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In This Issue 


This issue of Survey Methodology includes papers covering a variety of methodological subjects such as 
modeling and estimation, weighting and variance estimation, non-response and sampling. 

In the first paper of the issue, Skinner and Vieira investigate the effect of clustered sampling on variance 
estimation in longitudinal surveys. They present theoretical arguments and empirical evidence of the effects 
of ignoring clustering in longitudinal analyses, and find that these effects tend to be larger than for 
corresponding cross-sectional analyses. They also compare traditional survey sampling based methods to 
account for clustering in variance estimation to a multi-level modeling approach. 

Kovaéevic¢ and Roberts compare three models for analyzing multiple spells arising from data collected 
through longitudinal surveys with complex survey designs, which can involve stratification and clustering. 
These models are variations of the Cox proportional hazards model along the same lines as those proposed 
in the literature by Lin and Wei (1989), Binder (1992) and Lin (2000). These three models are compared 
using data from Statistics Canada’s Survey on Labor and Income Dynamics (SLID). This paper gives new 
insight into fitting Cox models to survey data containing multiple spells per individual, a situation that 
arises quite frequently. The paper also illustrates some of the challenges in fitting Cox models to survey 
data. 

Elliott, in his paper, presents a method for balancing elevated variance due to extreme weights with 
potential bias using a Bayesian weight trimming method in generalized linear models. This 1s accomplished 
by using a stratified hierarchical Bayesian model in which strata are determined by the probabilities of 
inclusion or survey weights. He illustrates and evaluates the approach using simulations based on linear and 
logistic regression models, and an application using data from the Partners for Child Passenger Safety 
dataset. 

The paper by Breidt, Opsomer, Johnson and Ranalli explores the use of semiparametric methods for the 
estimation of population means. In semiparametric estimation, some variables are assumed to be linearly 
related to the variable of interest while the other variables may have a complicated, unspecified relation to 
the variable of interest. The authors study theoretically the properties under the sampling design of the 
resulting estimators. In particular, they show the design-consistency and the asymptotic normality of their 
estimator. Their method is then applied to data from a survey of lakes in the northeastern United States. 

Tanguay and Lavallée address the problem of estimating the depreciation of assets based on a database 
of price ratios. In their paper, the issue is that the ratios do not come from a random sample from the 
population of ratios. The authors argue that the distribution of ratios should converge to a Uniform 
distribution and propose a weighting scheme that will make the weighted empirical distribution function 
approximately uniform. The proposed method is illustrated by an example using data on the depreciation of 
automobiles. 

Steel and Clark present an empirical and theoretical comparison of person-level generalized regression 
survey weights and integrated household-level weights in the case of a simple random sample of 
households in which all household members selected. They conclude that there is little or no loss in 
efficiency associated with integrated weighting. 

Saigo, in his paper, proposes a bootstrap variance estimation procedure for two-phase designs with high 
sampling fractions. The method uses common bootstrap techniques, but adjusts the values of the auxiliary 
variables for units that are selected in the first phase sample only. The proposed technique is illustrated 
using several commonly used estimators such as the ratio estimator, and estimators of the distribution 
function and quantiles. Results from a simulation study comparing the proposed method to several others 
are presented. 


2 In This Issue 


In the paper by Longford the problem of estimating the MSE of small area estimates is investigated. A 
composite estimator of the MSE of small area means is obtained by combining a model-based variance 
estimator and a naive estimator of the MSE. The coefficient that combines the two estimators minimizes 
the expected MSE of the resulting composite estimator of the MSE. The proposed estimator is compared 
with existing estimators through several simulation studies. 

Shao considers the problem of imputing for missing values when the nonresponse is nonignorable. In 
the situation where the nonresponse depends on a cluster level random effect, he shows that the commonly 
used mean imputed estimator is biased unless the mean of the cluster is used. For variance estimation, a 
jackknife variance estimation procedure for the proposed estimator is provided. The proposed estimator is 
compared with the mean imputed estimator by means of a simulation study. 

In the final paper of this issue, Tiwari, Nigam and Pant make use of the idea of nearest proportional to 
size sampling designs to obtain optimal controlled sample designs where non-preferred samples have zero 
selection probabilities. The optimal controlled sampled designs are obtained by combining an initial 
inclusion probability proportional to size design and quadratic programming techniques to ensure that non- 
preferred samples have a zero selection probability. Their method is illustrated using several examples. 


Harold Mantel, Deputy Editor 
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Variance estimation in the analysis of clustered 
longitudinal survey data 


Chris Skinner and Marcel de Toledo Vieira ! 


Abstract 


We investigate the impact of cluster sampling on standard errors in the analysis of longitudinal survey data. We consider a 
widely used class of regression models for longitudinal data and a standard class of point estimators of a generalized least 
squares type. We argue theoretically that the impact of ignoring clustering in standard error estimation will tend to increase 
with the number of waves in the analysis, under some patterns of clustering which are realistic for many social surveys. The 
implication is that it is, in general, at least as important to allow for clustering in standard errors for longitudinal analyses as 
for cross-sectional analyses. We illustrate this theoretical argument with empirical evidence from a regression analysis of 
longitudinal data on gender role attitudes from the British Household Panel Survey. We also compare two approaches to 
variance estimation in the analysis of longitudinal survey data: a survey sampling approach based upon linearization and a 
multilevel modelling approach. We conclude that the impact of clustering can be seriously underestimated if it is simply 
handled by including an additive random effect to represent the clustering in a multilevel model. 


Key Words: Clustering; Design effect; Misspecification effect; Multilevel model. 


1. Introduction 


It is well known that it is important to take account of 
sample clustering when estimating standard errors in the 
analysis of survey data. Otherwise, standard error estimators 
can be severely biased. In this paper we investigate the 
impact of clustering in the regression analysis of 
longitudinal survey data and compare it with the impact on 
corresponding cross-sectional analyses. Kish and Frankel 
(1974) presented empirical work which suggested that the 
impact of complex designs on variances decrease for more 
complex analytical statistics and so one might conjecture 
that the impact on longitudinal analyses might also be 
reduced. We shall argue that, in fact, the impact of 
clustering on longitudinal analyses can tend to be greater, at 
least for a number of common types of analysis and for 
some common practical settings. An intuitive explanation is 
that some common forms of longitudinal analysis of 
individual survey data ‘pool’ data over time and enable 
much temporal ‘random’ variation in individual responses 
to be ‘extracted’ in the estimation of regression coefficients. 
In contrast, it may only be possible to extract much less 
variation in the effects of clustering since such clustering, 
representing geography for example, often tends to generate 
more stable effects than repeated measurements of 
individual behaviour. As a consequence the relative 
importance of clustering in standard errors can increase the 
more waves of data are included in the analysis. 

In addition to considering the impact of clustering on 
variance estimation, we shall also consider the question of 
how to undertake the variance estimation itself. It is natural 
for many analysts to represent clustering via multilevel 


models (Goldstein 2003, Chapter 9; Renard and Molenberghs 
2002) and we shall consider how variance estimation 
methods based upon such models compare with survey 
sampling variance estimation procedures in the case of cluster 
sampling. 

There is a well established literature on methods for 
taking account of complex sampling schemes in the 
regression analysis of survey data. See eg., Kish and 
Frankel (1974), Fuller (1975), Binder (1983), Skinner, Holt 
and Smith (1989) and Chambers and Skinner (2003). We 
restrict attention here to ‘aggregate’ regression analyses 
(Skinner etal. 1989), where regression coefficients at the 
‘population level’ are the parameters of interest, where 
suitable estimates of these coefficients may be obtained by 
adapting standard model-based procedures using survey 
weights and where the variances of these estimated 
regression coefficients may be estimated by linearization 
methods (Kish and Frankel 1974; Fuller 1975). In this 
paper, we extend this work to the case when longitudinal 
survey observations are obtained, based upon an initial 
sample drawn according to a complex sampling scheme, 
focussing again on the case of a clustered design. We 
consider a standard class of linear regression models for 
such longitudinal data, as considered in the biostatistical 
literature (e.g., Diggle, Heagerty, Liang and Zeger 2002), 
the multilevel modelling literature (e.g., Goldstein 2003) 
and the econometric literature (e.g., Baltagi 2001). We 
consider an established class of point estimators of a 
generalized least squares type, modified by survey 
weighting. For some applications of such methods to survey 
data, see Lavange, Koch and Schwartz (2001); Lavange, 
Stearns, Lafata, Koch and Shah (1996). 


1. Chris Skinner, University of Southampton, United Kingdom; Marcel de Toledo Vieira, Universidade Federal de Juiz de Fora, Brazil. 


The impact of a complex sampling scheme on variance 
estimation will be measured by the ‘misspecification effect’, 
denoted meff (Skinner 1989a), which is the variance of the 
point estimator of interest under the actual sampling scheme 
divided by the expectation of a specified variance estimator. 
This is a measure of the relative bias of the specified 
variance estimator. If it is unbiased then the meff will be 
one. If the actual sampling scheme involves clustering but 
the specified variance estimator is ‘misspecified’ by 
ignoring the clustering, then the expectation of the variance 
estimator will usually be less than the actual variance and 
the meff will be greater than one. This concept is closely 
related to that of the ‘design effect’ or deff of Kish (1965), 
defined as the variance of the point estimator under the 
given design divided by its variance under simple random 
sampling with the same sample size, a concept more 
relevant to the choice of design than to the choice of 
standard error estimator. 

We shall illustrate our theoretical arguments with 
analyses of data from the British Household Panel Survey 
(BHPS) on attitudes to gender roles, where the units of 
primary analytic interest are individual women and the 
clusters consist of postcode sectors, used as primary 
sampling units in the selection of the first wave sample from 
an address register. 

The framework, including the models and estimation 
methods, is described in Section 2. The theoretical 
properties of the variance estimation methods are considered 
in Section 3. Section 4 illustrates these properties numer- 
ically, using an analysis of BHPS data. Some concluding 
remarks are provided in Section 5. 


2. Regression model, data and 
inference procedures 


Consider a finite population U = {l,..., N} of N units, 
assumed fixed across a series of occasions ¢ = l,...,7. We 
shall refer to the units as individuals, although our 
discussion is applicable more generally. Let y,, denote the 
value of an outcome variable for individual i e¢ U at 
occasion ¢ and let y, = (y;,..., ¥;r)’ be the vector of 
repeated measurements. Let x, denote a corresponding 
1x q vector of values of covariates for individual i at 
occasion ¢ and let x, = (x},...,x;-). We assume that the 
following linear model holds for the expectation of y, 
conditional on (x, ..., Xv): 


E(y;) = x,B, (1) 


where £8 isa q x1 vector of regression coefficients and the 
expectation is with respect to the model. We suppose that B 
is the target for inference, that is the regression coefficients 
are the parameters of primary interest to the analyst. 


Statistics Canada, Catalogue No. 12-001-XPB 


Skinner and Vieira: Variance estimation in the analysis of clustered longitudinal survey data 


Although we shall consider further features of this model, 
such as the covariance matrix of y,, these will be assumed 
to be of secondary interest to the analyst. 

The data available to make inference about 8 are froma 
longitudinal survey in which values of y, and x, are 
observed at each occasion (wave) ¢t =1,...,7 for indi- 
viduals i ina sample, s, drawn from U at wave | using a 
specified sampling scheme. For simplicity, we assume no 
non-response here, but return to this possibility in Section 4. 

In order to formulate a point estimator of B, we extend 
the specification of (1) to the following ‘working’ model: 


Vig = XB + U; + Vip (2) 


where u, and v,, are independent random effects with zero 
means and variances 67 = po’ and co? = (1—p)o re- 
spectively, conditional on (x,,...,.x,). This model may be 
called a uniform correlation model (Diggle et al. 2002, page 
55) or a two-level model (Goldstein 2003). The parameter 
p is the intra-individual correlation. 

The basic point estimator of B we consider is 


-l 
B = [x w, x; V «| Dw V~ Vi (3) 
where w, is a survey weight and V isa 7 xT estimated 
covariance matrix of y, under the working model (2), ie., it 
has diagonal elements 6’ and off-diagonal elements 66°, 
where (, 6’) is an estimator of (p, 0”). (Note that in fact 
6° cancels out in (3) and hence co” does not need to be 
estimated for ). In the absence of the weight terms and 
survey considerations, the form of is motivated by the 
generalized estimating equations (GEE) approach of Liang 
and Zeger (1986). The idea here is that B, as a generalized 
least squares estimator of B, would be fully efficient if the 
working model (2) held. However, 6 remains consistent 
under (1) and may still be expected to combine within- and 
between-individual information in a reasonably efficient 
way even if the working model for the error structure does 
not hold exactly. 

The survey weights are included in (3) following the 
pseudo-likelihood approach (Skinner 1989b) to ensure that 
6 is approximately unbiased for B with respect to the 
model and the design, provided (1) holds. 

There are a number of alternative ways of estimating p. 
In a non-survey setting, Liang and Zeger (1986) provide an 
iterative approach which alternates between estimates of B 
and p. Shah, Barnwell and Bieler (1997) describe how 
survey weights may be incorporated into this approach and 
implement this method in the REGRESS procedure of the 
software SUDAAN. By default, SUDAAN implements 
only one step of this iterative method and, in the non-survey 
setting, Lipsitz, Fitzmaurice, Orav and Laird (1994) 
conclude there is little to be lost by using only a single step. 
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For the working model in (2), the approach of Liang and 
Zeger (1986) to the estimation of B and p is virtually 
identical to the iterative generalized least squares (IGLS) 
estimation approach of Goldstein (1986). Both methods 
iterate between estimates of B and p and both use GLS to 
estimate B given the current estimate of p. The only slight 
difference is in the method used to estimate p. Pfeffermann, 
Skinner, Holmes, Goldstein and Rasbash (1998) show how 
to incorporate survey weights into the IGLS approach and 
their method may be expected to lead to very similar 
estimates of p to those in the SUDAAN REGRESS 
procedure. For the purposes of this paper, the precise form 
of 6 will not be critical and we may view 6 as either a 
weighted GEE or a weighted IGLS estimator. 

We now turn to the estimation of the covariance matrix 
of B under the complex sampling scheme. We shall gen- 
erally assume that a stratified multistage sampling scheme 
has been employed. We consider two main approaches to 
variance estimation. 

Our first approach is the classical method of linearization 
(Skinner 1989b, page 78). The estimator of covariance 
matrix of B is 


1B) = [Darts | 


ies 


x baz (in, — Wades) ord Ca z)' 
h a 


= 
x D W, X; a (4) 


where / denotes stratum, a denotes primary sampling unit 
(PSU), n, is the number of PSUs in stratum A, z,,= 


yw, x'Vol'e, Z, = D,2,,/n, and e, = y, —x,f. Similar 
estimators are considered by Shah etal. (1997, pages 8-9) 
and Lavange etal. (2001). If the weights, the sampling 
scheme and the difference between n/(n —1) and | are 
ignored, this estimator reduces to the ‘robust’ variance 
estimator presented by Liang and Zeger (1986). 

Our second approach is more directly model-based. The 
model is first extended to represent the complex population 
underlying the sampling scheme and inference then takes 
place with respect to the extended model. We consider only 
the case of two-stage sampling from a clustered population, 
where the two-level model in (2) is extended to the three- 
level model (Goldstein 2003): 


Yait 7 Xait B a LN sy us; By Voit? (5) 


The additional subscript a denotes cluster and the 
additional random term n, with variance om represents the 
cluster effect (assumed independent of u,, and v,,,). We let 
o. and co? denote the variances of u,, and v,, respec- 
tively. Inference then takes place using IGLS, which may be 


5 


weighted to avoid selection bias. This approach generates an 
estimated covariance matrix of the estimator of 8 directly. 
It should be noted, however that the estimator of B derived 
using weighted IGLS under model (5) may differ slightly 
from the estimator in (3) (although, for given estimates of 
the three variance components in (5), it will be the same as a 
weighted GEE estimator with a working covariance matrix 
based on this three-level model). Nevertheless, from our 
experience of social survey applications, such as in Section 
4, and from theory (Scott and Holt 1982) the difference 
between these alternative point estimators will often be 
negligible. 

Two broad approaches to deriving variance estimators 
from (5) are available. First, ignoring survey weights, the 
standard IGLS method (Goldstein 1986) may be employed, 
assuming that each random effect follows a normal distri- 
bution. Second, to avoid the assumption of normal 
homoscedastic random effects, a ‘robust’ variance esti- 
mation method (Goldstein 2003, page 80) may be 
employed. This approach is extended to handle survey 
weights in Pfeffermann etal. (1998). Leaving aside 
stratification, their variance estimator is identical to the 
linearization estimator in (4) for a given value of p. 


3. Properties of variance estimators 


In this section we consider the properties of the 
estimators of the covariance matrix of described in the 
previous section. We focus first on the linearization 
estimator v(B) in (4). 

The consistency of v(B) for the covariance matrix of 
follows established arguments in a suitable asymptotic 
framework (e.g., Fuller 1975; Binder 1983). The one non- 
standard feature is the presence of V~' in 8 and v(f) and 
the dependence of V on p. In fact, in large samples the 
covariance matrix of § depends on 6 only via its limiting 
value p' (in a given asymptotic framework). To see this, 
write B-B = (yea) DZ, where wu, = w,x; Veoxs 
z= w,x'V'é and é = y, — x,B. Note that, under weak 
regularity conditions (Fuller and Battese 1973, Corollary 3), 
the asymptotic distribution of § —B is the same as that of 
p= p= held pera. where u; = Ww, %; ge oe z;= 
w,x'V"'é and V" takes the same form as V with 6 
replaced by p’ = plim(§), the probability limit of 6 in the 
asymptotic framework. Writing Z = ya /n and U= 
plim(, u, /n), we may thus approximate the covariance 
matrix of § asymptotically by var(§) ~ U7! var(z")U'. If 
the working model (2) holds then p’ =p and this 
covariance matrix will be the same for any consistent 
method of estimating p. Even if the working model does 
not hold, v(f) will be consistent for U7' var(z’)U7' 
within the kinds of asymptotic frameworks considered by 
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Fuller (1975) and Binder (1983) and under the kinds of 
regularity conditions they and Fuller and Battese (1973) set 
out. 

We next explore the impact on the linearization method 
of ignoring a complex sampling design. We denote by 
v, (B) the linearization estimator obtained from expression 
(4) by ignoring the design, i.e., by assuming only a single 
stratum with PSUs identical to individuals so that n, = n is 
the overall sample size and z,, is replaced by z,= 
w,x'V~'e.. We shall be concerned with the bias of v,() 
when in fact the design is complex. Let B, denote the k" 
element of B and let v,(B,) denote the k' clement of 
Vo (8). Then, following Skinner (1989a, page 24), we shall 
measure the relative bias of the ‘incorrectly specified’ 
variance estimator v, (B,) as an estimator of var(B, ) by 
the misspecification effect, meff[B,, v)(B,)]= var(B,)/E[v 
(B, )]. Since v(B,) is a consistent estimator of var(B, ), 
meff[f,, v)(B,)] may be estimated by v(B,)/v,(B,) and 
is closely related to the idea of design effect. 

To investigate the nature of meff[B,, v)(B,)], we first 
write: 

(8) = (X44) Pe -d) 

ADK Aa ee oO! 
where Z = >,z,/n. Then, as an asymptotic approximation, 
we have E[v,(B)] ~ U'[n'S:]U"', where S: is the 
probability limit of the finite population covariance matrix 
of z. Using the fact that the numerator of meff[f,, 
v(B,)] may be approximated by U7! var(z")U~', we can 
thus write: 

(a yaan(z yu), 
(as) lipase 


where (U"'), is the k" row of U"'. This simplifies in the 
case g = | to: 


meff[B,, v,(B,)] = (7) 


meff[B,v,(8)] = var(zZ")/[n'S"]. (8) 


We may explore more specific forms of these 
expressions under different models and assumptions about 
the weights and the sampling scheme. We focus here on the 
impact of clustering, assuming equal weights and no 
stratification. Consider the three-level model in (5) and, to 
simplify matters, suppose that g = 1 and x,,, = 1 and B is 
the mean of y,,,. Then, straightforward algebra shows that 
the value of z for individual i within cluster a is 
(1+ p(T -)J'E, (m4 + Ya; + Yai). Now suppose that 
two-stage sampling is employed with a common sample 
size m per cluster. Then, evaluating the variance var(Z_) 
and probability limit S. in (8) with respect to the model in 
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(5), we find, in a similar manner to Skinner (1989a, page 
38): 


meff[B,v,(B)] = 1+ (m -1)t, (9) 


where 1 =o, 6, +o. +06./T) is the intracluster 
correlation of Zr We see that, under this model, the meff 
increases as T increases (provided o > 0) and thus the 
impact of clustering on variance estimation is greater in the 
longitudinal case than for the cross-sectional problem 
(where: Ea 1) 

This finding depends on the rather strong assumption that 
the cluster effects n,, are constant over time. In fact, (9) still 
holds if we replace n, by a time-varying effect 1,, 
provided we replace t by t =var(n,)/[var(n,) + 
o- +0./T], where 4, = ¥,n,,/T. Now, the meff will 
increase as T increases if (and only if) o, + o,/T 
decreases faster with T than var(n,). Whether this is the 
case will depend on the particular application. However, we 
suggest that for many longitudinal surveys of individuals 
with area-based clusters (the kind of setting we have in 
mind), this condition is plausible. In such applications we 
may often expect o7 to be large relative to o (i.e., for the 
cross-sectional intracluster correlation to be small) in 
particular as a result of wave-specific measurement error 
and thus for o + o /T to decrease fairly rapidly as T 
increases. The socio-economic characteristics of areas may 
often be expected to be more stable and only in unusual 
situations might we expect measurement error to lead to 
much occasion-specific variance in 1,,. Thus, we suggest 
that the ratio of var(n,) for T = 5, say, compared to 
T =1 may in such applications usually be expected to be 
greater than (co? + 07 /5)/(o, + 02) which will approach 
1/5 as Gs i o approaches 0. We thus suggest that in many 
practical circumstances it will be more important to allow 
for clustering in longitudinal analyses than in corresponding 
cross-sectional analyses. An empirical illustration is 
provided in Section 4. 

We now consider the properties of variance estimators 
based upon the three-level model in (5). We consider only 
the approach based upon the assumption of normally 
distributed homoscedastic random effects, ignoring survey 
weights, given the (virtual) equivalence of the ‘robust’ 
multilevel approach and linearization. 

If model (5) is correct and we can indeed ignore survey 
weights then the model-based variance estimator will be 
consistent (Goldstein 1986). However, as discussed in 
Skinner (1989b, page 68) and supported by theory in 
Skinner (1986), the main feature of clustering likely to 
impact on the standard errors of estimated regression coef- 
ficients is the variation in regression coefficients between 
clusters. This is not allowed for in model (5). 
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To see how model (5) may fail to capture the effects of 
clustering adequately, consider the cross-sectional case 
(T = 1) where x is scalar. Then, if the three-level model 
(5) holds, an approximate expression for the meff of the 
variance estimator of 8 based upon the two-level model (2) 
is: 

meff= be (ar—11) 1704 (10) 


where T= on Ko;, +o, +.) and 1, is the intracluster 
correlations for x (Scott and Holt 1982; Skinner 1989b, 
page 68). This result extends in the longitudinal case, to: 


1 < meff < 14+ (m-—l)it,, (11) 


where 7 is the long-run (T =) version of t (see 
Appendix) and t, is an intracluster correlation coefficient 
for Z; = ,%,,/Z. The proof of this result and~the 
simplifying assumptions required are sketched in the 
Appendix. The main point is that both t and t, will often 
be small in which case tt, will be very small and thus meff 
may be implausibly close to one with the model-based 
variance estimator being subject to downward bias. We 
explore this empirically in Section 4. Of course, random 
coefficients could be introduced into model (5) and we 
consider this also in Section 4. However, given the difficulty 
of specifying a correct random coefficient model, this 
approach does not seem likely to be very robust. 

Our focus in this section has so far been on the potential 
bias (or inconsistency) of variance estimation methods. It is 
also desirable to consider their efficiency. In particular, the 
linearization method may be expected to be less efficient 
than model-based variance estimation if the model is 
correct. The relative importance of efficiency vs. bias may 
be expected to increase as the number of clusters decreases. 
Wolter (1985, Chapter 8) summarises a number of 
simulation studies investigating both the bias and variance 
of the linearization variance estimator and these studies 
suggest that the linearization method performs well even 
with few clusters. Possible degrees of freedom corrections 
to confidence intervals for regression coefficients based 
upon the linearization method with small numbers of 
clusters are discussed by Fuller (1984). A simulation study 
of estimators for multilevel models in Maas and Hox (2004) 
does not suggest that the linearization method performs 
noticeably worse than the model-based approach, in terms 
of the coverage of confidence intervals for coefficients in 
B, even with as few as 30 clusters. 


4. Example: Regression analysis of BHPS 
data on attitudes to gender roles 


We now present an application to BHPS data to illustrate 
some of the theoretical properties discussed in the previous 
section. 
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Recent decades have witnessed major changes in the 
roles of men and women in the family in many countries. 
Social scientists are interested in the relation between 
changing attitudes to gender roles and changes in behaviour, 
such as parenthood and labour force participation (e.g., 
Morgan and Waite 1987; Fan and Marini 2000). A variety 
of forms of statistical analysis are used to provide evidence 
about these relationships. Here, we consider estimating a 
linear model of form (1), with a measure of attitude to 
gender roles as the outcome variable, y, following an 
analysis of Berrington (2002). 

The data come from waves 1, 3, 5, 7 and 9 (collected in 
1991, 1993, 1995, 1997, and 1999 respectively) of the 
BHPS and these waves are coded ¢ = 1,...,7 = 5 respec- 
tively. Respondents were asked whether they ‘strongly 
agreed’, ‘agreed’, ‘neither agreed nor disagreed’, ‘dis- 
agreed’ or ‘strongly disagreed’ with a series of statements 
concerning the family, women’s roles, and work out of the 
household. Responses were scored from 1 to 5. Factor 
analysis was used to assess which statements could be 
combined into a gender role attitude measure. The attitude 
score, y,,, considered here is the total score for six selected 
statements for woman i at wave ¢. Higher scores signify 
more egalitarian gender role attitudes. Berrington (2002) 
provides further discussion of this variable. A more 
sophisticated analysis might include a measurement error 
model for attitudes (e.g., Fan and Marini 2000), with each of 
the five-point responses to the six statements treated as 
ordinal variables. Here, we adopt a simpler approach, 
treating the aggregate score y, and the associated 
coefficient vector 8 as scientifically interesting, with the 
measurement error included in the error term of the model. 

Covariates for the regression analysis were selected on 
the basis of discussion in Berrington (2002) but reduced in 
number to facilitate a focus on the methodological issues of 
interest. The covariate of primary scientific interest is 
economic activity, which distinguishes in particular between 
women who are at home looking after children (denoted 
‘family care’) and women following other forms of activity 
in relation to the labour market. Variables reflecting age and 
education are also included since these have often been 
found to be strongly related to gender role attitudes (e.g., 
Fan and Marini 2000). All these covariates may change 
values between waves. A year variable (scored 1, 3, ..., 9) is 
also included. This may reflect both historical change and 
the general ageing of the women in the sample. 

The BHPS is a household panel survey of individuals in 
private domiciles in Great Britain (Taylor, Brice, Buck and 
Prentice-Lane 2001). The initial (wave one) sample in 1991 
was selected by a stratified multistage design in which 
households had approximately equal probabilities of 
inclusion. The households were clustered into 250 primary 
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sampling units (PSUs), consisting of postcode sectors. All 
resident members aged 16 or over were selected in sample 
households. All adults selected at wave one were followed 
from wave two onwards and represent the longitudinal 
sample. The survey is subject to attrition and other forms of 
wave non-response. To handle this non-response, we have 
simply replaced s in (3) by the ‘longitudinal sample’ of 
individuals for which observations are available for each of 
t=1,...,7 and have chosen not to apply any survey 
weighting since our aim is to study potential misspeci- 
fication effects associated with clustering and we wish to 
avoid confounding these with weighting effects. We also 
ignore the impact of stratification in the numerical work in 
this section (but see Section 5 for some comments on the 
effect of weights and stratification). 

Given the analytic interest in whether women’s primary 
labour market activity is ‘caring for a family’, we define our 
study population as women aged 16-39 in 1991. Thus our 
data consist of the longitudinal sample of women in the 
eligible age range for whom full interview outcomes 
(complete records) were obtained in all five waves, a sample 
of n= 1,340 women. These women are spread fairly evenly 
across 248 postcode sectors. The small average sample size 
of around five per postcode sector combined with the 
relatively low intra-postcode sector correlation for the 
attitude variable of interest leads to relatively small impacts 
of the design, as measured by meffs. Since our aims are 
methodological ones, we have chosen to group the postcode 
sectors into 47 geographically contiguous clusters, to create 
sharper comparisons, less blurred by sampling errors which 
can be appreciable in variance estimation. The meffs in the 
tables we present therefore tend to be greater than they are 
for the actual design. The latter results tend to follow similar 
patterns, although the patterns are less clear-cut as a result of 
sampling error. 

We first estimate meffs for the linearization estimator, as 
discussed at the beginning of Section 3. Using data from just 
the first wave and setting x,,, = 1, the estimated meff for 
this cross-sectional mean is given in Table | as about 1.5. 
This value is plausible since, if we make the usual 
approximation of (9) for unequal sample cluster sizes by 
replacing m by m, the average sample size per cluster, we 
find that 1+(m-—1)t=1.5 and m = 1,340/47 ~ 29 
imply a value of t of about 0.02 and such a small value is 
in line with other estimated values of t found for attitudinal 
variables in British surveys (Lynn and Lievesley 1991, 
Appendix D). 


Table 1 Estimates for longitudinal means 


a 


Bs.e. meffs 
Waves 1-9 1-9 Jig ta liste dees Moyle dhse 
19.83 0.12 [PORE Om OSM InOle 1LeO4 
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To assess the impact of the longitudinal aspect of the 
data, we estimated a series of meffs using data for waves 
l,...,¢ for t=2, 3,...,5. Although these estimated meffs 
are subject to sampling error, there seems clear evidence in 
Table 1 of a tendency for the meff to increase with the 
number of waves. This trend might be anticipated from the 
theoretical discussion in Section 3 if the average level of 
egalitarian attitudes in an area varies less from year to year 
than the attitude scores of individual women. This seems 
plausible since the latter will be affected both by 
measurement error and genuine changes in attitudes, so that 
var(7,,) may be expected to decline more slowly with T 
than var(u, + v,). We may therefore expect 1, and 
consequently the meff, to increase as T increases, as we 
observe in Table 1. 

We next elaborate the analysis by including indicator 
variables for economic activity as covariates. The resulting 
regression model has an intercept term and four covariates 
representing contrasts between women who are employed 
full-time and women in other categories of economic 
activity. The estimated meffs are presented in Table 2. The 
intercept term is a domain mean and standard theory for a 
meff of a mean in a domain cutting across clusters (Skinner 
1989b, page 60) suggests that it will be somewhat less than 
the meff for the mean in the whole sample, as indeed is 
observed with the meff for the cross-section domain mean 
of 1.13 in Table 2 being less than the value 1.51 in Table 1. 
As before, there is some evidence in Table 2 of tendency for 
the meff to increase, from 1.13 with one wave to 1.50 with 
five waves, albeit with lower values of the meffs than in 
Table 1. The meffs for the contrasts in Table 2 vary in size, 
some greater than and some less than one. These meffs may 
be viewed as a combination of the traditional variance 
inflating effect of clustering in surveys together with the 
variance reducing effect of blocking in an experiment. Such 
variance reduction arises if the domains being contrasted 
share a common cluster effect (of the form n, in model (5)) 
which tends to cancel out in the contrasts, implying that the 
actual variance of the contrast is lower than the expectation 
of the variance estimator which assumes independence 
between domains. The latter expectation will be inflated by 
common cluster effects. The main feature of these results of 
interest here is that there is again no tendency for the meffs 
to converge to one as the number of waves increases. If 
there is a trend, it is in the opposite direction. For the 
contrast of particular scientific interest, that between women 
who are full-time employed and those who are ‘at home 
caring for a family’, the meff is consistently well below one. 

We next refine the model further by including, as 
additional covariates, age group, year and qualifications. 
The estimated meffs are given in Table 3. The meffs for the 
regression coefficients corresponding to categories of 
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economic activity again vary, some being above one and 
some below one, for the same reasons as for the contrasts 
(which may also be interpreted as regression coefficients) in 
Table 2. There is again some evidence of a tendency for 
these meffs to diverge away from one as the number of 
waves increases. A comparison of Tables 1 and 3 confirms 
the observation of Kish and Frankel (1974) that meffs for 
regression coefficients tend not to be greater than meffs for 
the means of the dependent variable. 


Table2 Estimates for regression with covariates 
defined by economic activity 


p S.e. meffs 
Waves 1-9 1-9 1 1S ESS Sa ee e9 
Intercept 20.58 0.11 Pls Olee or os: 6 l50 
Contrasts for 
PTemployed -1.03 0.10 0.93 0.91 0.93 1.00 0.89 
Other inactive -0.80 0.15 0.60 0.96 0.68 0.76 0.81 
FT student 0.41 0.24 OM eS OMe a ailes Sumulieg 4: 


Family care -2.18 0.10 0.72 0.49 0.58 0.66 0.60 
Note: a) intercept is mean for women full-time employed 
b) contrasts are for other categories of economic activity 
relative to full-time employed 


Table 3 Estimates for regression coefficients with 
additional covariates in model 


Bp s.e. meffs 

Waves 1-9 1-9 rie SeleSeS rerali= Jon 1-9 
Intercept 20;,20,0.30 0% 50:95 :0,87.4.0.5 7 31.04) 107, 
Year, f -0.04 0.01 =48 O386"' 069 0.59"0:96 
Age Group 

16-21 0.00 - 

22-27 -0.71 0.25 1.22 1.37 1.44 1.73 1.64 

28-33 -0.89 0.27 1.38 1.40 1.46 1.68 1.59 

34+ -1.03 0.27. 0.94 1.10 1.13 1.26 1.34 


Economic Activity 
FT employed 0.00 - 


PTémployed —/-0.93 0:10. 1° 0:97°0.95'/-0.96" 1.060.911 
Other inactive -0.75 0.15 0.60 0.96 0.68 0.77 0.81 
FT student OAT O24" 77095 el 2 See ol 


Family care =2.09 O10" O77 “O59 0.702 80.78 0.67 


Qualification 


Degree 0.00 - 

QF -0.52 0.21 0.77 0.64 0.75 0.87 0.85 
A-level -0.61 0.24 0.98 0.87 0.94 0.94 1.01 
O-level -0.44 0.20 0.62 0.62 0.59 0.69 0.73 
Other 1.6 0.22 50:83) 0:83 078-7 0.80"0°82 


We next consider model-based standard errors obtained 
from the three level model in (5), as discussed in section 2. 
The results are given in Table 4 in the column headed ‘3 
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level model-based’. For comparison, we also estimate the 
standard errors under the two level model in (2) - the results 
are in the column headed ‘2 level model-based’. The 
estimates in the two columns are virtually identical. There is 
a single digit difference in the third decimal place for some 
coefficients and slightly greater difference for the intercept 
term. We suggest that this is evidence that simply adding in 
a random area effect term can seriously understate the 
impact of clustering on the standard errors of the estimated 
regression coefficients. This evidence is in line with the 
theoretical upper bound for the meff in (11). The estimated 
value of 7 in (11) is 0.019 and none of the covariates may 
be expected to display important intra-area correlation so the 
expected values of the variance estimators for the two-level 
and three-level models would be expected to be very close. 

We suggested in Section 3 that the main feature of 
clustering likely to impact on the covariance matrix of § is 
the variation in regression coefficients between clusters. We 
have explored this idea by introducing random coefficients 
in the model. Treating the elements of B now as the 
expected values of the random coefficients, we found that 
the estimates of 8 were hardly changed. We found that the 
estimated standard errors of these estimates were indeed 
inflated, much more so than from the introduction of the 
extra cluster random effect in model (5), and that the 
inflation was of an order similar to those of the meffs in 
Tables 2 and 3. Nevertheless, the IGLS method did lead to 
several negative estimates of the variances of the random 
coefficients, raising issues of which coefficients to allow to 
vary or more generally the issue of model specification. 
This problem is accentuated with increasing numbers of 
covariates, as the number of parameters in the covariance 
matrix of the coefficient vector increases with the square of 
the number of covariates. Overall, the inclusion of random 
coefficients seems to raise at least as many problems as it 
solves, if the clustering is not of intrinsic scientific interest, 
and thus does not seem a very satisfactory way to allow for 
clustering in variance estimation. It is simpler to change the 
method of variance estimation. 

As mentioned at the end of Section 2, one alternative is a 
‘robust’ variance estimation method based on the model in 
(5) (Goldstein 2003, page 80). Values of such robust 
standard error estimates are also included in Table 4. As 
anticipated in Section 2, the robust standard error estimator 
for the two level model performs very similarly to the 
linearization estimator which ignores clustering. The robust 
standard error estimator for the three level model performs 
very similarly to the linearization estimator which allows for 
two stage sampling. The slight differences reflect the 
differences between the methods of estimating V. 
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Table 4 Estimated standard errors of regression coefficients 


Linearization 


SRS complex 


2 level model-based 2 level robust 3 level model-based 3 level robust 


Intercept 0.287 0.296 0.253 
Year, ¢ 0.014 0.014 0.013 
Age Group 
16-21 
22-27 0.191. 0.245 0.155 
28-33 0.214 0.270 0.187 
34+ O25? * 0275 0.218 
Economic Activity 
FT employed 
PT employed 0.103 0.098 0.098 
Other inactive 0.166 0.150 0.146 
FT student 0.207 = 0.238 0.199 
Family care 03125) 0.102 0.112 
Qualification 
Degree 
QF 0.228 0.210 0.207 
A-level 0.238 0.239 0.209 
O-level 0.234 0.199 0.217 
Other 0.247 0.224 0.229 


Multilevel modelling 


0.288 0259 0.293 
0.014 0.013 0.014 
0.192 0.155 0.243 
0.215 0.187 0.266 
0.238 0.218 0.271 
0.103 0.098 0.096 
0.166 0.146 0.148 
0.207 0.199 0.236 
0.125 0.112 0.101 
0.228 0.208 0.211 
0.240 0.210 0.237 
0.235 0.218 0.199 
0.249 0.230 0.223 


The linearization method in the presence of two-stage 
sampling is thus very close to robust variance estimation 
methods used in the literature on multilevel modeling. The 
distinction between the methods becomes stronger if we 
allow also for stratification and weighting. Another 
distinction is that in the multilevel modeling approach, 
differences between model-based and the robust standard 
errors might be used as a diagnostic tool to detect departures 
from the model (Maas and Hox 2004). For example, the 
large differences in the three-level standard errors for the 
coefficients of age group in Table 4 might lead to 
consideration of the inclusion of random coefficients for age 
group. This contrasts with the survey sampling approach 
where the error structure in model (5) is only treated as a 
working model and it is not necessarily expected that 
standard errors based upon this model will be approximately 
valid. 


5. Discussion 


We have presented some theoretical arguments and 
empirical evidence that the impact of ignoring clustering in 
standard error estimation for certain longitudinal analyses 
can tend to be larger than for corresponding cross-sectional 
analyses. The implication is that it is, in general, at least as 
important to allow for clustering in standard errors for 
longitudinal analyses as for cross-sectional analyses and that 
the findings of, for example, Kish and Frankel (1974), 
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should not be used as grounds to ignore complex sampling 
in the former case. 

The longitudinal analyses considered in this paper are of 
a certain kind and we should emphasise that the patterns 
observed for meffs in these kinds of analyses may well not 
extend to all kinds of longitudinal analyses. To speculate 
about the class of models and estimators for which the 
patterns observed in this paper might apply, we conjecture 
that increased meffs for longitudinal analyses will arise 
when the longitudinal design enables temporal ‘random’ 
variation in individual responses to be extracted from 
between-person differences and hence to reduce the 
component of standard errors due to these differences, but 
provides less ‘explanation’ of between cluster differences, 
so that the relative importance of this component of standard 
errors becomes greater. 

The empirical work presented in this paper has also been 
restricted to the impact of clustering. We have undertaken 
corresponding work allowing for weighting and strat- 
ification and found broadly similar findings. Stratification 
tends to have a smaller effect than clustering. The sample 
selection probabilities in the BHPS do not vary greatly and 
the impact of weighting by the reciprocals of these 
probabilities on both point and variance estimates tends not 
to be large. There is rather greater variation among the 
longitudinal weights which are provided with BHPS data 
for analyses of sets of individuals who have responded at 
each wave up to and including a given year T. The impact 
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of these weights on point and variance estimates is some- 
what greater. As T increases and further attrition occurs, 
the longitudinal weights tend to become more variable and 
lead to greater inflation of variances. This tends to 
compound the effect we have described of meffs increasing 
with 7. 

Leaving aside consideration of stratification and 
weighting, we have compared two approaches to allowing 
for cluster sampling. We have treated the survey sampling 
approach as a benchmark. We have also considered a 
multilevel modelling approach to allow for clustering. We 
have suggested that the use of a simple additive random 
effect to represent clustering can seriously understate the 
impact of clustering and may lead to underestimation of 
standard errors. If the clustering is of scientific interest, one 
solution would be to consider including random 
coefficients. Another would be to use the ‘GEE2’ approach 
(Liang, Zeger and Qaqish 1992) and specify an additional 
parametric model for E(y, y;). If the clustering is treated as 
a nuisance, simply reflecting administrative convenience in 
data collection, we suggest the survey sampling approach 
has a number of practical advantages. This is discussed 
further by Lavange ef al. (1996, 2001) in relation to other 
applications to repeated measures data. 


Appendix 
Justification for (11) 


For simplicity, x and f£ are taken to be scalar, B is 
taken to be the ordinary least squares estimator and it is 
assumed that the sample sizes within clusters are all equal to 
m. The meff in (11) is defined as var, (B)/ E,[v, (B)], 
where E, and var, are moments with respect to the three- 
level model in (5) and v, (8) is a variance estimator based 
upon the two-level model in (2). Under (5) we obtain 


-2 
van) = [Dx] [LA +o Dd, +03 | 
cit c ci cit 

where + denotes summation across a suffix, On o. and o 
are the respective variances of n,,u,; and v,,, and x,,, 1S 
centred at 0. We further suppose that v, (8) is defined so 
that E[v, (BY) = (Dei xy = [(o;, +61) Dei weal +6, Da i ]. 
After some algebra we may show that 


meff =1+(m—1)7t_p[1+(T-l)t, ]/[1+(T7—l)pt,], (12) 


where t=0,/(c7 +0,), p=(o,+6,)(c, +6, + 
i) satin azironion. o> Sime i(nTsion of, 
i (a DyelnHo 7/1 aly Ty day Sroty/ aL) koe= 
DeaiZe/n, Oty =([L.(z,,/m) /C-02/m)/[1-1/m] and 


n = Cm is the sample size. Note that tp = t, and, when 


11 


T =1, t, =t, so that (12) reduces to (10). In general 
p <1 and (11) follows from (12). In fact, we estimate p as 
0.59 in our application so the bound in (11) is not expected 


to be very tight. 
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Modelling durations of multiple spells from longitudinal survey data 


Milorad S. Kovaéevié and Georgia Roberts ' 


Abstract 


We investigate some modifications of the classical single-spell Cox model in order to handle multiple spells from the same 
individual when the data are collected in a longitudinal survey based on a complex sample design. One modification is the 
use of a design-based approach for the estimation of the model coefficients and their variances; in the variance estimation 
each individual is treated as a cluster of spells, bringing an extra stage of clustering into the survey design. Other 
modifications to the model allow a flexible specification of the baseline hazard to account for possibly differential 
dependence of hazard on the order and duration of successive spells, and also allow for differential effects of the covariates 
on the spells of different orders. These approaches are illustrated using data from the Canadian Survey of Labour and 


Income Dynamics (SLID). 


Key Words: Cox regression; Design-based inference; Model-based inference; Spell order; SLID. 


1. Introduction 


The modelling problem addressed in this paper is known 
under different names such as correlated failure-time 
modelling, multivariate survival modelling, multiple spells 
modelling, or a recurrent events problem. It has been studied 
in the biomedical (e.g., Lin 1994, Hougaard 1999), social 
(Blossfeld and Hamerle 1989, Hamerle 1989) and economic 
literature (Lancaster 1979, Heckman and Singer 1982). 
Generally this type of modelling is required to address 
issues that arise in time-to-event studies when two or more 
events occur to the same subject and where the research 
interest is to assess the effect of various covariates on the 
length of a spell. Failure times are correlated within a 
subject, and thus the assumption of independence of failure 
times conditional on given measured covariates, required by 
standard survival models, is likely to be violated. In studies 
of duration of spells (poverty, unemployment, efc.), the 
“failure” is equivalent to exiting from the state of interest. 
An additional property of many multiple spells, often 
ignored, is that the spells are ordered “events”; that is, the 
second spell cannot occur before the first, etc. This paper 
was motivated by a study of unemployment spells, 
discussed further in Section 5. 

The dependence among the spells from the same 
individual arises from the fact that these spells share certain 
unobserved characteristics of the individual. The effect of 
these unobserved characteristics can be explicitly modelled 
as a random effect (e.g., Clayton and Cuzick 1985). When 
this is done, it is assumed that the random effect follows a 
known statistical distribution. The gamma distribution with 
mean | and unknown variance is the distribution of choice 
in many applications. Then, estimates of random and fixed 


effects can be obtained by some suitable method (e.g., 
two-stage likelihood (Lancaster 1979), using an EM 
algorithm (Klein 1992), etc.). This paper does not explore 
this approach. 

Another approach that has been taken - and is the one 
that we will be using - is to take a semi-parametric approach 
where we do not explicitly model the dependence among 
multiple spells. We model the marginal distributions of the 
individual spells, with a possible utilization of the order of 
the spells in the model specification. In the non-survey 
context, Lin (1994) describes how it is sufficient just to 
modify the “naive” covariance matrix of the estimated 
model coefficients obtained under the assumption of inde- 
pendence since the correlated durations need to be 
accounted for in the variance estimates but not in the 
estimates of coefficients per se. 

In socio-economic studies of spell durations the data 
sources are frequently longitudinal surveys with complex 
sample designs that involve stratification, sampling in sever- 
al stages, selection with unequal probabilities, stochastic 
adjustments for attrition and non-response, calibration to 
known parameters, efc. Consequently, it is necessary to 
account for the impact of the sample design on the 
distribution of the sample data when estimating model 
parameters and the variances of these estimates. Our 
approach when analyzing complex survey data is to model 
the marginal distributions of the multiple spells using single- 
spell methods, treating the dependence among the spells as a 
nuisance - both the dependence due to the correlation of 
spells from the same person and dependence among 
individuals due to the survey design - but taking account of 
the unequal selection probabilities through the survey 
weights. Based on the model chosen, finite population 
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parameters are defined and estimated as in Binder (1992). 
Standard errors are estimated using an appropriate design- 
consistent linearization method under the assumption that 
the primary sampling units are sampled with replacement 
within strata. This assumption is viable when the sampling 
rates at the first stage are small, as is generally the case in 
socio-economic surveys. Also, for such samples, the 
difference between finite population and superpopulation 
inference (i.e., the standard errors and the test statistics) has 
been found to be rather negligible (Lin 2000). Therefore, the 
results from inference based on our approach extend beyond 
the finite population under study. 

In the next section we review single-spell modelling and 
some methods for robust estimation of variances when the 
model is misspecified - first under a model-based frame- 
work and then under a design-based one. Section 3 contains 
further discussion of robust variance estimation for multiple 
spells. In Section 4, we introduce three models for multiple 
spells and describe how to fit these models using design- 
based robust estimation methods. In Section 5, we fit these 
models to data from the Canadian Survey of Labour and 
Income Dynamics (SLID) and discuss the results. Finally, 
Section 6 contains some overall remarks. 


2. Inference for the single-spell hazard rate model 


The duration of a spell (or simply, a spell) experienced 
by an individual is a random variable denoted by 7. We are 
particularly interested in the hazard function h(f) of T at time 
t, defined as the instantaneous rate of spell completion at 
time ¢ given that it has not been completed prior to time ¢, 
formally 


< > 
h() = lim went eee 
It it 


The value of the hazard function at ¢ is called the exit rate to 
emphasize that the completion of the spell is equivalent to 
exiting the state of interest. Duration models and analysis of 
duration in general are formulated and discussed in terms of 
the hazard function and its properties. 

From a subject matter perspective, frequently the main 
interest is to study the impact of some key covariates on the 
distribution of 7. A proportional hazards model is often 
chosen for such a study. Under the proportional hazards 
model, the hazard function of the spell T given a vector of 
possibly time-varying covariates x(t) = (x,(¢), ..., x,(4))' is 


A(t | x(t)) = do(t)e 0? (1) 


The function },(f) is an unspecified baseline hazard 
function and gives the shape of A(t|x(t)). The baseline 
hazard describes the duration dependence, such as whether 
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the hazard rate depends on time already spent in the spell. 
For example, negative dependence describes the situation 
where the longer the spell the smaller the probability of exit. 
If an individual has all x(t) variables set at 0, the value 
(level) of the hazard function is equal to the baseline hazard. 


2.1 Model-based inference 


The vector $B contains the unknown regression 
parameters showing the dependence of the hazard on the 
x (¢) vector, and may be estimated by maximizing the partial 
likelihood function (Cox 1975): 


n eX (TB ot 
L(B) = I] ene ae (2) 
= Neha Ve) Ce 


Ti! 


Here 7,,...,7,, are n possibly right-censored durations; 
6, =1 if 7; is an observed duration and 6; = 0 otherwise; 
and x,(t) is the corresponding covariate vector observed on 
[0, 7,]. The denominator sum is taken over the spells that 
are at risk of being completed at time 7,, ie, Y;=1 

if ¢<7,, and is equal to 0 otherwise. The estimate B of 
the model parameter B is obtained by solving the partial 
likelihood score equation 


Us(B)= mol, B)= 0, 3) 
where 
i _ ST, B) 
tell, B)=8,] x7) SED |, (4) 
5, B)=—Y Ye, (5) 
WS} 
and 
SB)=— VVOx (evr 6) 


If model (1) is true and the durations are independent, the 
model-based variance matrix of the score function U(B) is 


J (B) = —0U,(B)/ op 


Lacie je Te 
(0) (0) 2 > 
Sy ells B) yea) 


i 
i=] 


where 


50,8) == DLO x x, (De F 
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The approximate variance of B, obtained by linearization, 
is J-'(). 


If the form of (1) is incorrect but observations are 


independent, Lin and Wei (1989) provide the robust 
variance estimator for B as 
JB) GB) 7"), (7) 
where 
G(B) = > g\(B) g;(B) 
i=l 
and 


g(B) = u,(7;, B) 


, x/(T, )B 
$0 DE Nay. 


efiaretion (ip PD 


S(T, B) 
Snel P is i 


2.2 Design-based inference 


For observations from a survey with a complex sample 
design, Binder (1992) used a pseudo-likelihood method to 
estimate the parameters and their variances for a pro- 
portional hazards model in the case of a single spell per 
individual. In particular, he first defined the finite population 
parameter of interest as a solution of the partial likelihood 
score equation (3) calculated from the spells of the finite 
population targeted by the survey: 


N 
CAB) tt 0 

i=] 
where u,)(7,,B) is the score residual defined in the same 
way as u,,(7,,B), except that the averages in the definitions 
of st, B) and gs (t,B) extend over N observations 
rather than n. Note that if all members of the finite 
population targeted by the survey do not experience spells, 
N represents the size of the subpopulation that experiences 
spells, and the summation is over these N individuals. 

An estimate B of the parameter B is obtained as a 

solution to the partial pseudo-score estimating equation 


N 
U,(B) =) w,(s) dio (T,, B) = 


where w,(s)=w,, the survey weight, if ies, and 0 


otherwise. Function @,,(7,, B) takes the form 


(i B) 


i (T B)=8,4 r)= §(T, B) > 


where 
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N m 
SQGB) = wie VO 


T=) 
and 


wl a 
S (t,B) = ) w,(s) Y,(t)x,(t) ex (OB 
i=] 


A 


Generally, the design-based variance of an estimate 0 
that satisfies an estimating equation of the form U(6) = 
w,u,(8) = 0 can be estimated, using linearization, as 


J'V(U(6)) J", (9) 


where J =0U(®)/6@ is evaluated at @=6, and V(U(6)) 
is the estimated variance of the estimated total U(@) 
obtained by some standard design-based variance estimation 
method (see for example Cochran (1977)) and evaluated at 
@ =6. Binder (1983) states that in order to use this approach 
to derive a consistent estimate of the variance, U(6) must 
be expressed as a sum of independent random vectors. In 
the case of the proportional hazards model above, U,(B) 
does not satisfy this condition since each #,, is a function of 
S(T, B) and $(T,,B), both of which include many 
individuals besides the i" one. Thus, Binder (1992) found 
an alternative expression for U,(B) which conforms to 
these conditions, making it possible to obtain a design 
consistent estimate VU, (B)) by application of a design- 
based variance estimation method to the alternate expression 
and then evaluating this variance estimate at B = B. If the 
design-based variance estimation method chosen is the 
linearization method, then the first step consists of 
calculating the following residual for each of the sampled 
individuals: 


i,(T,, B) = ti, (T,, B) 
Lila) pean: SOT. B) 
z | (py ef Loe ihUS W( 10 
Y,6 )8; “5OC,B) B) x,(T ;) U7, B) (10) 


Each individual in the sample belongs to a particular PSU 
within a particular stratum. Thus, instead of identifying an 
individual by a single subscript i we will use a triple 
subscript Aci where h=1,2,...,H identifies the stratum, 


c=1,2,..., c, identifies the PSU within the stratum and 
i=1,2,..., m,. identifies the individual within the PSU. 
Then 


PG, Bye te ea 


h=l Cp a paz 


3 Gay t,) (ie cy 4)’, 


where 
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Ch 
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3. Inference for multiple-spell hazard rate models 


3.1 Model-based inference 


If more than one spell is observed for an individual, it is 
realistic to assume that these spells are not independent. 
Thus, the partial likelihood function (2) is misspecified for 
multiple spells since it does not account for intra-individual 
correlation of the spells observed on the same individual. 
Following Lin and Wei (1989), it is sufficient to modify 
only the covariance matrix of the estimated model 
parameters since the correlated durations affect the variance 
while the model parameters can be estimated consistently 
without accounting for this correlation. Lin (1994) 
demonstrates how the covariance matrix of the estimated 
model parameters might be estimated when there is intra- 
individual correlation of spells, provided that spells from 
different individuals are independent. 


3.2 Design-based inference 


In a longitudinal survey with a multi-stage design, the 
multiple events can be correlated at different levels: the 
spells are clustered within an individual, and individuals are 
clustered within high-stage units. The positive intracluster 
correlation at any level adds extra variation to estimates 
calculated from such data, beyond what is expected under 
independence. The assumption of independence of obser- 
vations when they are cluster-correlated leads to underesti- 
mating the true standard errors, which inflates the values of 
test statistics, and ultimately results in too-frequent rejection 
of null hypotheses. Thus, for multiple spells for individuals, 
where the data are from a longitudinal survey, accounting 
just for correlation within individuals is insufficient. 

Design-based variance estimation for nested cluster- 
correlated data can be greatly simplified when it is 
reasonable to assume that individuals from different primary 
sampling units (PSU’s) are uncorrelated. This is equivalent 
to assuming that the PSU’s are sampled with replacement. 
This assumption also holds approximately when the first 
stage units are obtained by sampling without replacement, 
provided that the sampling rate at the first stage is very 
small. In such a case, an estimate of the between-PSU 
variability captures the variability among units in all 
subsequent stages, regardless of the dependence structure 
among observations within each PSU. For a recent 
summary of robust variance estimation for cluster-correlated 
data see Williams (2000). This implies that Binder’s (1992) 
approach for robust variance estimation of the single-spell 


Statistics Canada, Catalogue No. 12-001-XPB 


model in the case of a survey design having with- 
replacement sampling at the first stage can be directly 
applied to the multiple spell situation since it accounts for 
the impact of cluster-correlation at all levels within each 
PS 


4. Three models for multiple spells 


In order to allow the covariates to have different effects 
for spells of different orders, as well as to allow different 
time dependencies (baseline hazards), we are exploring 
three models for multiple spells. The models differ ac- 
cording to the definition of the risk set and the assumptions 
about the baseline hazard. Two of these models account for 
the order of the spells. 

It should be noted, however, that in our work, spell order 
refers only to spells occurring in the observation period 
from which the data are collected and not to the entire 
history of an individual (unless these two time periods 
coincide). For example, by the first spell we mean a first 
spell in the observation period although it may be a spell of 
some higher absolute order over the person’s lifetime. This 
limitation implies a careful interpretation of any impact that 
spell order may have on covariate effects or on time 
dependency. 


Model 1: In the first model, the risk set is carefully defined 
to take spell order into account in the sense that an 
individual cannot be at risk of completing the second spell 
before he completes the first, etc. This model, known as the 
conditional risk set model, was proposed by Prentice, 
Williams and Peterson (1981) and was reviewed by Lin 
(1994). It was also discussed by Hamerle (1989) and 
Blossfeld and Hamerle (1989) in the context of modelling 
multi-episode processes. Generally, the conditional risk set 
at time ¢ for the completion of a spell of order j consists of 
all individuals that are in their j" spells. This model allows 
spell order to influence both the effect of covariates and the 
shape of the baseline hazard function. 

The hazard function for the i" individual for the spell of 
j" order is 


h,(t|x,,(¢)) =o; (t) e HOB, 


where, for each spell order, a different baseline hazard 
function and a different coefficient vector are allowed. For 
this model and for other models that will be considered in 
this Section, time ¢ is measured from the beginning of the 
j" spell. Although spells within the same individual may 
not be independent, the following partial likelihood is still 


valid for estimation of the B , ’s: 
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x; (7, )B, 4 


K Nj e 
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Here, 7,,,..., Ty, are N, durations of possibly right- 
censored j order spells, 6, =1 if 7, 1s an observed 
duration and 6, = 0 otherwise, and K is the highest order 
of spells to be included in the Cox model. The denominator 
sum is taken over the j" spells that are at risk of being 
completed at time Be 1. e8 YEU = Mites Jem, and is 
equal to 0 otherwise. The corresponding covariate vector 
observed on [0, 7;,] is x,(¢). Partial likelihood (11) can be 
maximized separately for each j if there are no additional 
restrictions on the B ,’s. 

The corresponding score equations that define the finite 
population parameter B = (B;, B4,...,B,.)' are: 


Key 
ie 8) Boop hae ray Case ye heen (12) 
j=l i=l 
with 
SD Uae 
nol i> B;) = 8; x; (7) -—-— ; 
yO\L i> Dj 4 jp \t ij SOC BR ) 


and with S“)(7,B,) and S“(t,B,) having the form of (5) 
and (6) respectively, but with N, replacing n and B, 
replacing f. 

The design-based estimates of the parameters B, are 
obtained by solving equations ¥‘;w,(s)a,9(7,,B,) =0 
separately for each j, where #,. has the form of uj. but 
with S and S“ replaced by $ and S$“ respectively. 
Note that the sampling weights correspond to individuals 
and not to spells. Similarly, estimation of the covariance 
matrix of each B ; Will be done separately using the design- 
based robust estimation approach described in Section 2.2. 
Technically, this is a set of analyses separated by spell 
order. 


Model 2: The second model considered is the marginal 
model (Wei, Lin and Weissfeld 1989): 


RO) SRO ee 


where, for each spell order, we allow a different baseline 
hazard function while the covariate effects are kept the same 
over different spell orders. The corresponding partial like- 
lihood function as well as the risk set, under the assumption 
that spells within the same individual are independent, is the 
same as for Model 1, with B replacing the B,’s. The 
corresponding score equation that defines the finite 
population parameter is 


17 


K WN; 
Uy (B)= 2) do jo (7, B) = 0, 


j=l i=) 
with 
* Al wf B) 
RE ea 5 
ij? 
where S‘°(¢, B) and S(t, B) are defined by (5) and (6) 
respectively, but with N, replacing n and B replacing B. 


The design-based estimate of the parameter B_ is 
obtained by solving the weighted score equations 


Kae : 
DL (5) Ajo (TB) = 0, 


where 17; has the form of 4 but with S)(t,B) and 
S°(t,B) replaced by S$ (t,B) and S(t, B) 
respectively. 

The estimation of the covariance matrix of B will be 
done using the design-based robust estimation approach 
explained in Section 3.2. 


Model 3: The last model considered is the following: 
h;(t|x,) =o) e Lalas 


In this model we assume that the baseline hazard 
functions and the effects of covariates are common for 
different orders of spells. The risk set at time 7 is defined 
differently than for Models 1 and 2, and contains all spells 
with t <7;, effectively assuming that all spells are from 
different individuals. Technically, this model is a single- 
spell model, so that estimation of coefficients and variances 
by a design-based robust method is straightforward. 


5. Example of modelling multiple 
unemployment spells 


5.1 The data 


The data set that we use for illustration comes from the 
first six-year panel (1993-1998) of the Canadian Survey of 
Labour and Income Dynamics (SLID). In this panel, about 
31,000 individuals from approximately 15,000 households 
were followed for six years through annual interviews. 
Some individuals dropped out of the sample over time for 
any number of reasons while a few others, after missing one 
or more interviews, resumed their participation. A complex 
weighting of the responding SLID individuals each year 
takes into account different types of attrition so that each 
respondent in a particular year is weighted against the 
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relevant reference population of 1993. This results in a 
separate longitudinal weight for each wave (i.e., year) of 
data. For this analysis we used the longitudinal weights 
from the last year of the panel, i.e., 1998, which meant that 
data just from those individuals who were respondents in 
the final wave of the panel were included in the analyses. A 
good summary of the sample design issues in SLID is given 
in Lavigne and Michaud (1998). A review of the issues 
related to studies of unemployment spells from SLID is 
given in Roberts and Kovaéevic¢ (2001). 

The state of interest is “being unemployed”, defined in 
this case as the state between a permanent layoff from a full- 
time job and the commencement of another full-time job. A 
job is “full-time” if it requires at least 30 hours of work per 
week. The event of interest is “the exit from un- 
employment’. Only spells beginning after January 1, 1993 
were included since January 31, 1993 is the starting date for 
observations from the panel. Spells that were not completed 
by the end of the observation period (December 31, 1998) 
were considered censored. Sample counts of the number of 
individuals experiencing eligible spells and the number of 
spells according to their order are given in Table 1. In brief, 
there were17,880 spells from 8,401 longitudinal individuals. 
About half of the sampled individuals (4,260) who became 
unemployed during this period experienced two or more 
unemployment spells. There were 3,809 spells that re- 
mained uncompleted due to the termination of the panel. 

From a long list of available covariates we chose only 
ten. The variable sex [SEX] of the longitudinal individual is 


the only variable that remains constant over different spells. 
Four variables have values recorded at the end of the year in 
which the spell commenced: education level [EDUCLEV] 
with 4 categories (low, low-medium, medium, high), marital 
status [MARST] with three categories (single, married/ 
common law, other), family income per capita (in Canadian 
dollars) with 4 categories (<10K, 10K-20K, 20K-30K, 
30K+), and age [AGE] (in years). Three variables have the 
values from the lay-off job preceding the spell: type of job 
ending [TYPJBEND] with two categories (fired and 
voluntary), occupation [OCCUPATION] with 6 categories 
(professional, administration, primary sector, manufac- 
turing, construction, and others); and firm size [FIRMSIZE] 
with five categories (<20, 20-99, 100-499, 500-999, 1,000 + 
employees). Two binary variables represent the situation 
during the spell: having a part time job [PARTTJB], and 
attending school [ATSCH]. 

The data set was prepared in the “counting process” style 
where each individual with eligible spells is represented by 
a set of rows, and each row corresponds to a spell. Although 
a row contains time of entry to the spell ¢,, and time of exit 
t, or time of censoring ¢,, the duration time for analysis is 
always considered in the form (0, ¢, —4,) or (0, ¢, —4). 
The covariates under consideration are attached to each row. 
Also attached to each row are the 1998 longitudinal weight 
and the identifiers for the stratum and the PSU of the person 
whose spell is being described by that record. 


Table 1 Counts of individuals in the six-year panel of SLID with unemployment spells beginning between January 1993 
and December 1998, by the total number of spells and by order of spell (C-completed, U-uncompleted) 


Individuals by number Spells by order 
of spells First Second Third Fourth 5h + 
C U C U C U C U C U 

1 spell 4,141 2,221 1,920 - - - - - - - 
2 spells 1,915 1,915 - V3154 = 761 - - - - - - 
3 spells 1,044 1,044 - 1,044 612 432 . - - - 
4 spells 629 629 - 629 629 - 348 281 - - 
5+ spells 672 672 - 672 672 - 672 - 1,158 415 
Total 8,401 6,481 1,920 3,499 761 1,913 432 1,020 281 1,158 415 


Statistics Canada, Catalogue No. 12-001-XPB 


Survey Methodology, June 2007 


5.2 Analysis 


For the purpose of this illustration we restricted the 
analysis to the first four spells, which means that all sampled 
individuals with eligible spells are included in the analysis 
but the spell records after the fourth spell are not considered 
due to their small number in the sample. 

We estimated coefficients and their variances for the 3 
models by the design-based methods described in Section 4 
through the use of the “SURVIVAL” procedure in 
SUDAAN Version 8. For all three models, the survey 
design was specified to be stratified with with-replacement 
selection of PSU’s (i.e., DESIGN = WR). All three models 
were fit to the same number of spells (16,307). For each 
model, we then calculated the empirical cumulative baseline 
hazard functions using a product-limit approach (see 
Kalbfleisch and Prentice (2002), pages 114-116) as imple- 
mented in the SURVIVAL procedure in SUDAAN. 

In the robust model-based approach for multiple spells 
described in Section 3.1, there is an adjustment in the 
variance estimates to account for the possible dependence 
among spells from the same _ individual, assuming 
independence of spells from different individuals; however, 
in this approach, no account is made for the unequal 
probabilities of selection of the sampled individuals - in 
either the coefficient estimates or the variance estimates. In 
order to do this, for models 1 and 2 we also used the 
SURVIVAL procedure in SUDAAN Version 8, to estimate 
the variances of the weighted coefficient estimates where 
we assumed independence of spells between individuals but 
allowed for possible correlation of spells from the same 
individual. We did this by specifying the sampling design to 
be unstratified and having with-replacement selection of 
clusters, and we specified that each individual formed his 
own cluster. The dependence assumptions are the same as 
those used by Lin (1994) but we accounted for the use of 
weights in the estimation of the coefficients and the 
variances. We will call these variance estimates “modified 
robust model-based variance estimates of weighted 
coefficient estimates”. 


5.3 Some descriptive statistics 


The estimated mean duration of a completed spell is 33.3 
weeks while the estimated mean duration of the observed 
portion of a censored (uncompleted) spell is 48.5 weeks. 

Visual examination of estimated Kaplan-Meier survival 
functions (not shown) for spells of each order indicated that, 
as order increased, the value of the survivor function at any 
fixed time ¢ decreased, indicating that first spells are the 
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longest among completed spells, and that the higher the 
order of a multiple spell the shorter is its duration. This is 
likely to be a consequence of the limited life of the panel, in 
the sense that an individual with more spells in the given 
six-year time frame is likely to have shorter spells. 


5.4 Model fits using a design-based approach 


As noted earlier, our example is just an illustration of the 
design-based approach to fitting proportional hazards 
models to multiple-event data from a survey with a complex 
design. Thus, little time is spent in this article on discussing 
how to assess the adequacy of these models, such as the 
adequacy of the proportionality assumptions in all of the 
models or whether one type of model fits as well as another. 

Estimated coefficients from fitting the three models to 
the SLID data are given in Table 2. Coefficients found 
significant at the 5% level, through the use of individual ¢ 
tests, are shown in bold. 

Model | is conditional on the spell order and involved 
fitting four models separately to the data from the four 
different spell orders. As seen in Table 2, SEX, AGE, and at 
least one category of the Family Income variable were 
significant for spells of all orders, although magnitudes of 
the estimated coefficients differed with the spell order. The 
estimated coefficients for AGE were negative but decreased 
in magnitude as the spell order increased, while there was 
no discernable pattern in the estimated coefficients for the 
other 2 variables. The variables EDUCLEV, PARTJB and 
ATSCH had significant coefficients for spells of order 1, 2, 
and 3, but not for spells of order 4. This can be at least partly 
attributed to the small sample size for the fourth spells. For 
each of the other three variables in the model (MARST, 
OCCUPATION, and FIRMSIZE), there was just one spell 
order for which a coefficient was significant. 

For Model 2, the model coefficients are restricted to be 
the same for all spell orders. As seen in Table 2, numerically 
many - but not all - of the estimated coefficient values were 
situated between the estimates for the first and the second 
spells obtained for Model | which could be due to the fact 
that a high proportion of the sample corresponded to events 
of these orders. All but the OCCUPATION variable had a 
significant coefficient. Standard errors of coefficients were 
smaller for Model 2 than for Model 1. 

Model 3 is a single-spell model with a single set of 
model coefficients and a single baseline hazard function. 
The estimated model coefficients are similar to the estimates 
obtained by Model 2. 
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Table 2 Estimated B coefficients for three models 


Model 1 Model 2 Model 3 
Order 1 Order 2 Order 3 Order 4 


SEX (F) 

M 0.4417 0.3781 0.3299 0.4435 0.4049 0.4090 
EDUCLEV (H) 

1 -0.4561 -0.5234 -0.3748 -0.1065 -0.4128 -0.4331 

LM -0.2330 -0.2700 -0.3310 -0.1653 -0.2436 -0.2474 

M -0.0744 -0.1060 -0.1156 0.0668 -0.0684 -0.0671 
MARST (M) 

Single -0.1142 -0.1290 -0.0622 -0.1375 -0.1357 -0.1330 

Other 0.0985 -0.0894 0.1124 -0.1072 0.0328 0.0401 
TYPJBEND (Fired) 

Voluntary 0.0704 0.2752 0.4207 0.3413 0.1579 0.1284 
OCCUPATION(Othrs) 

Professionals 0.1592 -0.1364 -0.1388 0.0903 0.0490 0.0485 

Admin -0.0265 -0.2930 -0.1769 0.0579 -0.0971 -0.0938 

PrimSector -0.0211 -0.2175 -0.1187 0.2032 -0.0410 -0.0201 

Manufacture -0.0003 -0.0994 -0.1295 0.2862 -0.0093 -0.0088 

Construction 0.1290 -0.1862 -0.0879 0.2339 0.0490 0.0813 
FIRMSIZE (1000+) 

<20 -0.0027 -0.0097 0.1005 0.4403 0.0441 0.0408 

20-99 0.0358 0.0881 0.0815 0.3999 0.0928 0.0951 

100-499 0.0436 -0.0905 0.0328 0.0257 0.0214 0.0278 

500-999 -0.0006 0.0153 -0.0623 -0.0067 -0.0005 0.0020 
PARTTJB (No) 

Yes -0.2903 -0.5414 -0.5109 -0.1407 -0.3693 -0.3743 
ATSCH (No) 

PES -1.0832 -1.1516 -1.2956 -1.3541 -1.1205 -1.1266 
Family Income Per Capita 
(10K-) 

10K-20K 0.1294 0.1802 0.0692 0.1117 0.1345 0.1330 

20K-30K 0.1644 0.3611 0.1572 0.4900 0.2241 0.2141 

30K+ 0.1712 0.3916 0.3005 0.4241 0.2280 0.2115 
AGE -0.0491 -0.0311 -0.0269 -0.0207 -0.0424 -0.0435 
Spells in risk set 8,386 4,255 2,345 1,300 16,286 16,286 

Censored 1,913 759 432 281 3,385 3,385 

Completed 6,473 3,496 1,913 1,019 12,901 12,901 


The values significant at a 5% level are bold. 


The estimated cumulative baseline hazard functions for 
Models | to 3 are given in Figures | to 3 respectively. In all 
cases, for durations up to approximately 50 weeks, the 
functions have a concave shape, implying that there is a 
positive time dependence of the exit rate (i.e. the longer the 
spell, the higher the probability of exit). For durations longer 
than 50 weeks, the shapes become convex, suggesting 
negative time dependence for the longer spells. In Figure 1, 
positions of the estimated cumulative baseline hazard 
functions vary according to spell order, with the curve for 
spells of order 1 being the highest, and the curve for spells 
of order 4 being the lowest. In Figure 2, for Model 2, the 
positions of the different curves do not follow spell order. 
This observed difference between Figures 1 and 2 could 
serve as one visual diagnostic that further study is required bahaciimance bie <4. acidltingiec mbilomenerrcene os 
in order to assess whether Model 1 or Model 2 is a better Figure 1 Cumulative Baseline Hazard — Model 1 
descriptor of the data, since estimated coefficients have an 
impact on the estimated baseline hazards. 


Cumulative Baseline Hazard 
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Figure 2 Cumulative Baseline Hazard — Model 2 
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Cumulative Baseline Hazard 
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Figure 3 Cumulative Baseline Hazard — Model 3 


5.5 Comparison to modified robust model-based 
variance estimates 


As described in Section 5.2, the modified robust model- 
based variance estimates account for possible correlation 
among spells from the same individual, where independence 
among individuals is assumed. When, for Models 1 and 2, 
the standard error estimates obtained by this approach were 
compared to the design-based standard error estimates, only 
very minor differences were observed. This would seem to 
indicate that the design-based estimates are picking up any 
correlation among spells from the same individual and also 
that there does not appear to be additional dependence 
above the level of the individual for our particular example. 


6. Concluding remarks 


We explored the problem of analysis of multiple spells 
by considering two general approaches for dealing with the 
lack of independence among the exit times: a robust model- 
based approach and a design-based approach. The first 
approach estimates the model parameters assuming inde- 
pendence of the spells, and then corrects the naive co- 
variance matrix to account for within-individual depen- 
dencies postulated by the researcher. This approach does not 
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account for the possible clustering between individuals (or, 
in fact, for any clustering that might occur at a level above 
the individual) nor for the unequal probabilities of selection 
of individuals (although, in our example, we showed how 
the method could be extended to include the survey 
weights). The second approach defines the model coeffi- 
cients as finite population parameters. These parameters are 
then estimated accounting for possible unequal selection 
probabilities of individuals. A design-based variance esti- 
mation method that accounts for possible correlations be- 
tween individuals in the same PSU automatically accounts 
for the unspecified dependencies of spells at levels below 
the PSU, such as dependencies within individuals. For large 
sample sizes this design-based inference extends directly to 
the super-population from which, hypothetically, the finite 
population was generated. The deficiency of the first 
approach is that it totally ignores the potential for clustering 
between individuals. A possible disadvantage of the second 
approach, as we applied it, is that it relies on the assumption 
of with-replacement sampling of PSU’s of individuals. The 
two approaches coincide in the case of simple random 
sampling of individuals where, in the robust model-based 
approach, dependence among spells from the same 
individual is explicitly postulated and accounted for in the 
variance estimation formula and where, in the design-based 
approach, spells from the same individual are treated as a 
cluster in the design-based variance estimation. 

We applied the design-based approach to three propor- 
tional-hazards-type models. One model allowed for differ- 
ential unspecified baseline hazards and different coefficients 
for each spell order. The second model still allowed for 
differential unspecified baseline hazards for different spell 
orders but required the coefficients to be the same over 
orders. The third model was a simple single-spell model. 
We found that how information on the spell order was used 
affected the results of our model-fitting. A visual compare- 
ison of the coefficient estimates and the estimates of the 
cumulative baseline hazards for Models 1 and 2 indicated 
different results. A formal test for whether the coefficients 
actually differ by spell order (as allowed in Model 1), given 
baseline hazards that can differ by spell order, would be 
useful, as suggested by one of the referees. It is actually 
straightforward to produce such a test, and can be done as 
follows. Let y = (Bi, B, ,...,B,)' be the vector of all K 
coefficient vectors of Model 1, where each has length p, 
and let z,(t) = (0', 0’ ..., x, (0)', 0’, ..., 0')’ be the vector of 
length pK for the j" spell of the i individual where the 
j" component of this vector contains the vector of 
covariates x,,(t). Then, Model 1 can be expressed as 


h(t|z,()) =A e”! On 
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which has the general form of baseline hazards varying with 
spell order but a fixed coefficient vector. A test for 
constancy of the coefficients pertaining to each spell order, 
ie, H,:B,=B,=...B, is equivalent to testing 
H,,: Cy =0 .cwhere -C is..the, CK ai) px iKp matrix 
C=1, O[lk., —I,_,]. Given an estimate 7 of y and an 
estimate V(¥) of the covariance matrix of 7, obtained as 
described in Section 4 for Model 2, a Wald statistic may be 
calculated in order to test the hypothesis. If the hypothesis is 
not rejected, it may be concluded that a model with constant 
coefficients over spell order (but baseline hazard varying 
with spell order) appears to fit the data as well as a model 
where both baseline hazard and coefficients vary with spell 
order. Other measures for model adequacy should also be 
straightforward to develop under the design-based frame- 
work. 

We also visually compared, for our example, coefficient 
standard error estimates obtained under the design-based 
approach (accounting for clustering at the PSU level and 
lower) and obtained under a modification of the robust 
model-based approach (accounting for clustering at the 
individual level and lower) for Models 1 and 2. We found 
only minor differences, which indicated no clustering 
effects above the individual level for these particular data. 
We also calculated standard error estimates assuming 
independence even between spells from the same person 
and again found only minor differences with those obtained 
from the design-based approach. It thus seems that, for this 
particular example, there is little inter-spell dependence. 
However, in general, we feel that a design-based approach 
guards against missing any unpostulated dependencies at the 
PSU level and lower in the variance estimates. 
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Bayesian weight trimming for generalized linear regression models 


Michael R. Elliott ! 


Abstract 


In sample surveys where units have unequal probabilities of inclusion in the sample, associations between the probability of 
inclusion and the statistic of interest can induce bias. Weights equal to the inverse of the probability of inclusion are often 
used to counteract this bias. Highly disproportional sample designs have large weights, which can introduce undesirable 
variability in statistics such as the population mean estimator or population regression estimator. Weight trimming reduces 
large weights to a fixed cutpoint value and adjusts weights below this value to maintain the untrimmed weight sum, 
reducing variability at the cost of introducing some bias. Most standard approaches are ad-hoc in that they do not use the 
data to optimize bias-variance tradeoffs. Approaches described in the literature that are data-driven are a little more efficient 
than fully-weighted estimators. This paper develops Bayesian methods for weight trimming of linear and generalized linear 
regression estimators in unequal probability-of-inclusion designs. An application to estimate injury risk of children rear- 
seated in compact extended-cab pickup trucks using the Partners for Child Passenger Safety surveillance survey is 


considered. 


Key Words: Sample survey; Sampling weights; Weight Winsorization; Bayesian population inference; Weight 


smoothing; Generalized linear mixed models. 


1. Introduction 


Analysis of data from samples with differential 
probabilities of inclusion typically use case weights equal to 
the inverse of the probability of inclusion to reduce or 
remove bias in the estimators of population quantities of 
interest. Replacing implicit means and totals in statistics 
with their case-weighted equivalents yields unbiased linear 
estimators and asymptotically unbiased non-linear esti- 
mators of population values (Binder 1983). Case weights 
may also incorporate non-response adjustments, which 
typically are equal to the inverse of the estimated probability 
of response (Gelman and Carlin 2002, Oh and Scheuren 
1983), or calibration adjustments, which constrain case 
weights to equal known population totals, either jointly, as 
in poststratification or generalized regression estimation, or 
marginally, as in generalized raking estimation (Deville and 
Samndal 1992, Isaki and Fuller 1982). 

There is little debate that sampling weights be utilized 
when considering descriptive statistics such as means and 
totals obtained from unequal _probability-of-selection 
designs. However, when estimating “analytical” quantities 
(Cochran 1977, page 4) that focus on associations between, 
e.g., risk factors and health outcomes estimated via linear 
and generalized linear models, the decision to use sampling 
weights is less definitive (cf Korn and Graubard 1999, 
pages 180-182). In a regression setting, discrepancies 
between weighted and unweighted regression slope 
estimators can occur either because the data model is 
misspecified or there is an association between the residual 
errors and/or the probability of inclusion (sampling is 


informative). When the data model is misspecified, one 
option is to improve the model specification. However, it 
may be difficult to determine the exact functional form; or it 
may be that the degree of misspecification is very modest 
but is magnified by the sample design; or it may be that an 
approximation to the true model is desired to simplify 
explanation (linearly approximating a quadratic trend). In 
the case of informative or non-ignorable sampling, design 
weights may be required to obtain consistent estimators of 
regression parameters (Korn and Graubard 1995). More 
formally, fully-weighted estimators of regression parameters 
are “pseudo-maximum likelihood” estimators (PMLEs) 
(Binder 1983, Pfeffermann 1993) in that they are “design 
consistent” for MLEs that would solve the score equations 
for the regression parameters under the assumed 
superpopulation regression model if we had observed data 
for the entire population. Design consistency implies that 
the difference between the population target quantity and 
the estimate derived from the sample tends to zero as the 
sample size and population size jointly increase, or that 
these differences will on average tend to 0 from repeated 
sampling of the population, where samples are selected in 
an identical fashion from t¢—o replicates of the 
population: see Sarndal (1980) or Isaki and Fuller (1982). If 
observations are clustered, more care must be taken to 
develop design consistent estimators of PLMEs, although 
nested multi-stage designs allow for the census log- 
likelihood estimates to be approximated using weighted 
score equations if care is taken to account for the fact that 
the within-cluster sample sizes typically are small and 
remain so even if the number of clusters increases 
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(Pfeffermann, Skinner, Holmes, Goldstein and Rabash 
1998, Korn and Graubard 2003). 

Although PMLEs are popular because of design 
consistency, this property is purchased at the cost of 
increased variance. This increase can overwhelm the 
reduction in bias, so that the MSE actually increases under a 
weighted analysis. This is particularly likely if a) the sample 
size is small, b) the differences in the inclusion probabilities 
are large, or c) the model is approximately correctly 
specified and the sampling is approximately noninfor- 
mative. Perhaps the most common approach to dealing with 
this problem is weight trimming (Potter 1990, Kish 1992, 
Alexander, Dahl and Weidman 1997), in which weights 
larger than some value w, are fixed as wo. Typically wy is 
chosen in an ad hoc manner - say 3 or 6 times the mean 
weight - without regard to whether the chosen cutpoint is 
optimal with respect to MSE. Thus bias is introduced to 
reduce variance, with the goal of an overall reduction in 
MSE. 

Other design-based methods have been considered in the 
literature. Potter (1990) discusses systematic methods for 
choosing w,, including weight distribution and MSE 
trimming procedures. The weight distribution technique 
assumes that the weights follow an inverted and scaled beta 
distribution; the parameters of the inverse-beta distribution 
are estimated by method-of-moment estimators, and 
weights from the upper tail of the distribution, say where 
1—F(w,)<0.01, are trimmed to w, such that 1- 
F(w,) = 0.01. The MSE trimming procedure determines 
the empirical MSE at trimming level w,, where the trimmed 
weight w. = wJ(w.2w)+wJ(w <w)i=1,..,7 
under the assumption that the fully weighted estimate is 
unbiased for the true mean. In practice, one considers a 
variety of trimming levels 7=1,.... 7, where f=! 
corresponds to the unweighted data (w, = min,(w,)) and 
t=T to the fully-weighted data (w, = max,(w,)), and 6, 
is the value of the statistic using the timmed weights at 
level ¢. The trimming level chosen is then given by 
Wy =w., Where t’ =argmin,(MSE,) for MSE, = (6, _ 
6.) +V(0,). 

In the calibration literature, techniques have been 
developed that allow generalized poststratification or raking 
adjustments to be bounded to prevent the construction of 
extreme weights (Deville and Sarndal 1992, Folsom and 
Singh 2000). Beaumont and Alavi (2004) extend this idea to 
develop estimators that focus on trimming large weights of 
highly influential or outlying observations. While these 
bounds trim extreme weights to a fixed cutpoint value, the 
choice of this cutpoint remains arbitrary. 

An alternative approach to the direct weight trimming 
procedures has been developed in the Bayesian finite 
population inference literature (Elliott and Little 2000, Holt 
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and Smith 1979, Ghosh and Meeden 1986, Little 1991, 
1993, Lazzeroni and Little 1998, Rizzo 1992). These 
approaches account for unequal probabilities of inclusion by 
considering the case weights as stratifying variables within 
strata defined by the probability of inclusion. These 
“inclusion strata” may correspond to formal strata from a 
disproportional stratified sample design, or may be “pseudo- 
strata” based on collapsed or pooled weights derived from 
selection, poststratification, and/or non-response adjust- 
ments. Standard weighted estimates are then obtained when 
the weight stratum means of survey outcomes are treated as 
fixed effects, and trimming of the weights is achieved by 
treating the underlying weight stratum means as random 
effects. These methods allow for the possibility of 
“nartially-weighted” data that uses the data itself to appro- 
priately modulate the bias-variance tradeoff, and also allows 
estimation and inference from data collected under unequal 
probability-of-inclusion sample designs to be based on 
models common to other fields of statistical estimation and 
inference. 

This paper extends these random-effects models, which 
we term “weight smoothing” models, to include estimation 
of population parameters in linear and generalized linear 
models. Section 2 briefly reviews Bayesian finite population 
inference, formalizes the concept of ignorable and non- 
ignorable sampling mechanisms, and develops the weight 
smoothing models for linear and generalized linear 
regression models in a fully Bayesian setting. Section 3 
provides simulation results to consider the repeated 
sampling properties of the weight smoothing estimators of 
linear and logistic regression parameters in a dispro- 
portional-stratified sample design and compares them with 
standard design-based estimators. Section 4 illustrates the 
use of the weight smoothing estimators in an analysis of risk 
of injury to children in passenger vehicle crashes. Section 5 
summarizes the results of the simulations and considers 
extensions to more complex sample designs. 


2. Bayesian finite population inference 


Let the population data for a population with 
f=], 2 N units” be, given “by T= (yn Va). Winn 
associated covariate vectors X=(x,,..., Xv) and 


sampling indicator variable J =(J,, ..., Jy), where J, =1 if 
the i" element is sampled and 0 otherwise. As in design- 
based population inference, Bayesian population inference 
focuses on population quantities of interest O(Y), such as 
population means Q(Y)=Y or population least-squares 
regression parameters Q(Y, X)=ming , D(y; — By - 
B,x,). In contrast to design-based inference, but consistent 
with most other areas of statistics, one posits a model for 
the population data Y as a function of parameters 0: 
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Y ~ f(Y |®). Inference about Q(Y) is made based on the 
posterior predictive distribution of p(Y,., | Yon. 7), where 
Y_., consists of the elements of Y, for which /, = 0: 


py, nob la obs? ¥®) oa 
[J pon [Yon ) (1 1¥,0.4) PY, 10) 78, 6)d 046 


(1) 
[Vocal Yves 0) PLY. .0) (yy, [8) (0,6) 40d bY, , 


where p(J/|Y, 8, ) models the inclusion indicator. 

If we assume that @ and 6 are a priori independent and 
if the distribution of sampling indicator / 1s independent of 
Y, the sampling design is said to be “unconfounded”’ or 
“noninformative’’; if the distribution of 7 depends only on 

Y,.» then the sampling mechanism is said to be “ignorable” 
(Rubin 1987), equivalent to the standard missing data 
terminology (the unobserved elements of the population can 
be thought of as missing by design). Under ignorable 
sampling designs, p(0,$)= p(®)p() and p(/|Y, 0,6) = 
PU |Y.,.> >), and thus (1) reduces to 


J DYocy [Yonur ©) (Yo, 18) (040 
VCs Yones ©) PY, 10) (404, , 


allowing inference about Q(Y) to be made without 
explicitly modeling the sampling inclusion parameter / 
(Ericson 1969, Holt and Smith 1979, Little 1993, Rubin 
1987, Skinner, Holt and Smith 1989). Noninformative 
sample designs are a special case of ignorable sample 
designs, equivalent to missing completely at random 
mechanisms being a special case of missing at random 
mechanisms. 

In the regression setting, where inference is desired about 
parameters that govern the distribution of Y conditional on 
fixed and known covariates XY, (1) becomes 


P(Yoor | Yopsa i) = 
I ron Yanse X0 8 4) 
PUY, X, 8 0) Pers |X, 8) PG, o)d0do 


[cv ) aka loot 0, ) x 
PUY, X, 8, 6) PVons |X; 9) PO, O)d8dbaY op 


= PYror |Yors)» (2) 


~ which reduces to 


if and only if 7 depends only on (Y. 


Po | Yops2 X) = 


J poy ones X; 8 6) (Yn, |X, 0) (0, 6) 40 
I] Don [Yeas X 8, 6) P(Yne |X, 0) (0, 6) d0AY,,, 


obs) X), Of which 
dependence on X only is a special case. Thus if inference 
is desired about a regression parameter Q(Y, X), then a 


_ noninformative or more generally ignorable sample design 
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can allow inclusion probabilities to be a function of the 
fixed covariates. 


2.1 Accommodating unequal probabilities of 
inclusion 


Maintaining the ignorability assumption for the sampling 
mechanism often requires accounting for the sample design 
in both the likelihood and prior model structure. In the case 
of the unequal probability-of-inclusion sample designs, this 
can be accomplished by developing an index h=1,...,H of 
the probability of inclusion (Little 1983, 1991); this could 
either be a one-to-one mapping of the case weight order 
statistics to their rankings, or a preliminary “pooling” of the 
case weights using, e.g., the 100/H percentiles of the case 
weights. The data are then modeled by 


Viel Onex Fy s)9,))) 1 = Veo, 


for all elements in the h" inclusion stratum, where 6, 
allows for an interaction between the model parameter(s) 0 
and the inclusion stratum h. Putting a noninformative prior 
distribution on 6, then reproduces a fully-weighted analysis 
with respect to the expectation of the posterior predictive 
distribution of QO(Y). 

To make this concrete, assume we are MGs in 
estimating a population mean Q(Y)=¥ =N'D™ “,¥, from 
a unequal probability-of-inclusion sample with a simple ran- 
dom sample within inclusion strata. Rewnting as O(Y) = 
D,P,Y, where Y,=N,'y, is the population inclusion 
stratum mean and P, = N,/N, we have 


E(Y Yop.) = LEC, Xstaplo= 


N- LAM ope (N,, Pid EY a any! 


where Y, is decomposed into the observed inclusion 
stratum mean Yy, oy, =, 'y™ T,,¥,;, and the yngbsenved 
inclusion stratum mean Y, ,, = (N, —7,) eed (dies 


T,;)¥,;- lf we assume 
ind 
Vi [Mir Sy ~ N(My 55) 


P(My,»o,) © | 
then 


XY, h,nob [Y, te) 


CEC, h,nob | re) Beat Hp» o,)'= a E(u, | i) = Vitave 


and the posterior predictive mean of the population mean is 
given by the weighted sample mean: 


EW |Yous) = LAE, [Yoos) 


Ny 
Nowds Ns Yi obsiin Ney we ee 


h_ i=l 
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where w,; = w, = N,/n, for all the observed elements in 
inclusion stratum h. Further, the weighted mean will be the 
posterior predictive expectation of the population mean for 
any assumed distribution of Y as long as E(y,,; |H,) = u,- 
In contrast, a simple exchangeable model for the data 


y; |? = N(u,07) 


P(u, 07) « 1 


yields E(Y |¥.,.) =” 'D*,1y,, the unweighted estimator 
of the mean, which may be badly biased if exchangeability 
fails to hold, as would be the case if there is an association 


between the probability of inclusion and Y. 


2.2 Weight smoothing models 


In its general form, our proposed “weight smoothing 
method” stratifies the data by the probability of inclusion 
and then uses a hierarchical model to effect trimming via 
shrinkage. A general description of such a model is given by 


Vn 19, ~ LVni3 9,) (3) 
6, | M,,u, R ve NOY OR), Ve = g(M,, ) 


Latah ad iece: lil, 


where h =1,...,H indexes the probability of inclusion 
from the highest to the lowest probabilities, g(M,,) is a 
function linking information M, from the inclusion 
probability stratum and a smoothing parameter p to the 
data distribution parameter 6, indexed by the inclusion 
stratum, and IT is a flat or weakly informative hyper- 
parameter distribution (Little 2004). 

The particulars of the likelihood and prior specifications 
will depend on the population parameter of interest, the 
sample design, distributional assumptions about y, and 
efficiency-robustness tradeoffs. Positing an exchangeable 
model on the inclusion stratum means from the previous 
example yields (Lazzeroni and Little 1998, Elliott and Little 
2000) 


Yi 18, ~ NO, 7) 
0, N(u, 72). 
Assuming for the moment co” and t* known, we have 
0h eRe 
NL {From + Mn = MECH | Yor) 
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where E(u, 1 %us) = WY, += m)y. for w, = 
tn, M(t°n,+o°) and ~=(,n,/(n,t°+0°)) 'Xpn,/(n, 0? + 
o’)y, As t? >, w, > bso thate E()}%,,) = 
>, ¥,; thus a flat prior recovers the fully-weighted esti- 
mator, as we showed previously. On the other hand, as 
t’ > 0, w, > 0 so that E(u, |¥.) > Fle = Js the 
unweighted mean; thus the excluded units of the sample are 
estimated at the pooled mean since the model assumes that 
all y,, are drawn from a common mean. Hence this weight 
smoothing model allows compromise between the design- 
consistent estimator which may be highly inefficient, and 
the unweighted estimator that is fully efficient under the 
strong assumption that the inclusion probability and mean of 
Y are independent. By assuming a weak hyperprior distri- 
bution on 1’, the degree of compromise between the 
weighted and unweighted mean will be “data-driven,” albeit 
under the modeling assumptions. 


2.3 Weight smoothing for linear and generalized 
linear regression models 


Generalized linear regression models (McCullagh and 
Nelder 1989) postulate a likelihood for y, of the form 


y,9; — 5(8;) 
a;() 


where a,;(d) involves a known constant and a (nuisance) 
scale parameter $, and the mean of y, is related to a linear 
combination of fixed covariates x, through a link function 
g(): Ey, |8,) =H, where g(u;) = g(b'(8,)) =n, = 
Xa B. We also have Var(y, | 9;) = a,(0)V(u,), where 
V(u,) = b''(0,). The link is canonical if 8; = n,, in which 
case g'(u,) = V'(u,). Well-known examples are the 
normal distribution, where a,()=o" and the canonical link 
is g(u,)=H,; the binomial distribution, where a,(>)=n,' 
and the canonical link is g(u,) = log(u,/(1 — y,)); and the 
Poisson distribution, where a,(d) = 1 and the canonical 
link is g(y;) = log(n,;). 

Indexing the inclusion stratum by h, we _ have 
g(ELy,, |B,]) = xriB,- We assume a hierarchical model 
of the form 


S(9;;9;, >) = ep| TCV 0| (4) 


(Bi, ... By) |B’. G ~ Nyp(B, G). (5) 


where B* is an unknown vector of mean values for the 
regression coefficients and G is an unknown covariance 
matrix. | 

We consider the target population quantity of interest 
B=(B,..- B,) to be the slope that solves the population 
score equation U,(B) = 0 where 
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Uy(B) => sy ees B) = 
at (ye, ie Su (Ha Be 
i i (6) 
py a M(H, (B)) 2H, (B)) 


Note that the quantity B such that U(B) = 0 is always 
a meaningful population quantity of interest even if the 
model is misspecificed (i.e., n,; 1s not exactly linear with 
respect to the covariates), since it is the linear approximation 
of x, to n; = g(p,). Under the model given by (4) and (5), 
a first-order approximation (assuming a negligible sampling 
fraction) to E(B |y, X) is given by B where 


x, 5 Gn = 8 Hi BD% 
ee a (EB) (ue ) 


ete, Wat = Ni) ee Noe TONY ‘and Ouge 
E(B, |¥, X), as determined by the form of (5). (If MV, is 
unknown, it can be replaced with N, = Yj, and the 
N,,...N,, treated as a multinomial distribution of size N 
parameterized by unknown inclusion stratum probabilities 
ys +5 Gy With, e.g., a Dirichlet prior.) Thus, in the example 
of linear regression, where V(p1,) = o° and oh) =) 
resolves to 


B = E(B |y, X) = 
EMF wed] [TI (Seust]s,). 
h i=l h i=l 


In the example of logistic regression, where V(u,)= 
u(1—p,) and g'(u,)=H;'(1—p,)', E(B |y, X) is given 
by solving for the population regression parameters B,, 
eee 


= 0 (7) 


H n, 


exp(X,,/B,) (9) 
ive —. 
fafa! Sm atttohey EXP(Xp,/B ,,) 


This can be accomplished via simple root-finding 
numerical methods such as Newton’s Method. 
We consider four forms of B’ and G_ in (5) in this 


paper: 
1. Exchangeable Random Slope (XRS): 
B,, = (Bo, ... B’) for all h, G = I,, ® D.. (10) 
2. Autoregressive Random Slope (ARS): 
B;, = (Bo, 5B ;,),foriall h, 
GF ROT NON pts py OU Ee ONT 
3. Linear Random Slope (LRS): 
Bam (Boo'* Bite Boo + Bri”), 
GJM ON 
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4. Nonparametric Random Slope (NPRS): 
By = (fy(h),..., f,(A), G = 0. 


f, : f; absolutely continuous, v = 0,1, 
lA wy? du < ©, 
min D Bi — f(A)? +2, Ly a 


where / again indexes the probability of inclusion, /,, is 
an HxH _ identity matnx, p is an autocorrelation 
parameter that controls the degree of shrinkage across the 
weight strata, >> is an unconstrained px p covariance 
matrix, A is a px p diagonal matrix, and f,(h) is a 
twice differentiable smooth function of 4 that minimizes 
the residual sum of squares plus a roughness penalty 
parameterized by 4, (Wahba 1978, Hastie and Tibshirani 
1990). Reformulating the NPRS model as in Wang (1998) 
we have 


ind 
Vie | B, a N(xhiBi> 0°) 
By, 7 Bi = Bik +O, 4; 


ind 
REAIN PURO NT yaan= ool EE Nj =i0salt, p 


where @, is the h" row of Choleski decomposition of the 
cubic spline basis matrix © where Q,, = [((h- 
1)/(H -1)- t), (k- 1)/(A- 1)- 6), dt, (x), = xif x 20 
and. (x), =04f x:<.0,,h,k.= ly.) H..The. NPRS. model 
can be extended into the generalized linear model form as in 
Lin and Zhang (1999), where the first-stage normality 
assumption is replaced with a link function that is linear in 
the covariates: g(E(y,,; |B,)) = XB pp for g(-) as in (4). 

Assuming for the moment that the second stage 
parameters are known, we see that, in the case of the XRS 
model with normal data, as |G |— ©, sharing of infor- 
mation across inclusion strata ceases, and 8, ~ 
(x/x,)'x/y,, the regression estimator within the 
inclusion stratum. Replacing this into (8) yields B ~ B”, 
the fully weighted estimator of the population slope. 
Similarly, as |G |— 0, the within-inclusion-stratum slopes 
B, = B’ the common prior slope, yielding Bx B° when 
replaced in (8), or B" if a non-informative hyperprior 
distribution is placed on 8” and its posterior mean obtained 
as (x’x)'x’ y. Empirical or fully Bayesian methods that 
allow the data to estimate the second stage parameters thus 
allow for data-driven “weight smoothing,” compromising 
between the unweighted and fully-weighted estimators. 

In practice, of course, the second-stage mean and 
variance components are usually not known; hence we 
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complete the model specification by postulating a hyper- 
prior distribution for the second-stage parameters: 


p(>,B,G) x pS). 


Typically the hyperprior distribution p(€) is either 
weakly informative or non-informative. Gibbs sampling 
(Gelfand and Smith 1990; Gelman and Rubin 1992) can 
then be utilized to obtained draws from the full joint 
posterior of (B, B*, 6, G)’ | y, X. In the XRS model, we con- 
sider p(o,B’,y) « 07 | OP?!) exp(-1/2 tr {rd7'}), 
that is, non-informative prior distributions on the scale and 
prior mean parameters and an independent inverse-Wishart 
hyperprior distribution on the prior variance G centered at 
the identity matrix scaled by r with p degree of freedom. 
The same prior distribution is used for the ARS model, with 
the additional assumption that p~U/(0,1) (non-negative 
autocorrelation between inclusion strata). In the LRS and 
NPRS models, p(o,B*,A) « o and p(o,B’,t) < 0” 
(standard non-informative scale prior distribution and 
hyperprior distribution). Description of the conditional 
draws of the Gibbs sampler are available at http://www. 
sph.umich.edu/mrelliot/trim/meth2.pdf. 

The degree of compromise is a function of the mean and 
variance structure of the chosen model. The XRS and ARS 
models assume exchangeable slope means; the ARS model 
is more flexible in that its variance structure allows units 
with more nearly equal probabilities of inclusion to be 
smoothed more heavily than units with very unequal 
probabilities of inclusion. The LRS model assumes an 
underlying linear trend in slopes, whereas the NPRS model 
assumes only an underlying trend smooth up to its second 
derivative. Note that, in the LRS and NPRS models, we 
assume a priori independence for the regression parameters 
associated with a given covariate, i.e., (B,,,-.- By) L(B,;. -- 
Bi), j # J This is because we model trends in these 
parameters across the inclusion stratum, and do not wish to 
“link up” these trends across the covariates. 

Shrinkage will be greatest, corresponding to the most 
severe weight trimming, when the weight stratum slopes 
have little variability, or when the lowest probability-of- 
inclusion stratum are poorly estimated. Little shrinkage 
should occur when weight stratum slopes are precisely 
estimated and when they are systematically associated with 
their probability of inclusion. Based on Elliott and Little 
(2000), we would expect the XRS model to be the most 
efficient when large amounts of weight trimming are 
required to minimize MSE, but to be the most vulnerable to 
“overshrinking” when bias correction is most important. 
Increasing structure, particularly in the mean portion of the 
model as in LRS and NPRS, will provide more robust 
estimation in the sense that overshrinkage will occur only in 
near-pathological situations (e.g., when mean trends are 
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non-monotonic and highly discontinuous), and even then 
may only lead to slightly less bias correction than the data 
warrant. The price to be paid for this robustness, however, 
will be a reduction in efficiency relative to the exchangeable 
models. 


3. Simulation results 


Because we desire models that are simultaneously more 
efficient than design based estimators yet reasonably robust 
to model misspecification - and in general we feel that even 
Bayesian models should have good frequentist properties - 
we evaluate our proposed models in a repeated sampling 
context. We consider linear and logistic regression, under a 
misspecified model with a non-informative sampling 
design. 


3.1. Linear regression 


For the linear regression model in the presence of model 
misspecification, we generated population data as follows: 


Y, |X,,0° ~ N(aX, + BX?, 0°), (11) 


X, ~ U(0,10),i =1,..., N = 20,000. 


A noninformative, disproportionally stratified sampling 
scheme sampled elements as a function of X, (/; equals 1 if 
sampled and 0 otherwise): 


h, =| X; | 


PU, =1|h) =, « (1+h/2.5)h 


This created 10 strata, defined by the integer portions of 
the XY, values. Elements (Y,, X,;) had ~ 1/36" the selec- 
tion probability when 0 < X, <1 as when 9<X,<10. We 
sampled n = 500 elements without replacement for each 
simulation. The object of the analysis is to obtain the pop- 
ulation slope B, = 0%, (Y, — Y)(X, — X)/DN(X, — XY. 
We fixed a = B =1, yielding a positive bias in the esti- 
mate of B,, and varied o*. The effect of model misspeci- 
fication increases as 6” —> 0 as the bias of the estimators 
becomes larger relative to the variance, and conversely 
decreases as co? — 0. We considered values of co? = 10’, 
/ =1,...,5; 200 simulations were generated for each value 
of o°. 

Here and below we _ utilized an_ inverse-Wishart 
hyperprior distribution on the prior variance G, centered at 
the identity matrix with 2 degree of freedom. 

In addition to the exchangeable random slope (XRS), 
autoregressive random slope (ARS), linear random slope 
(LRS), and nonparametric random slope (NPRS) models 
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discussed in Section 2.3, we consider the standard designed- 
based (fully weighted) estimator, as well as timmed weight 
and unweighted estimators. For the fully-weighted (FWT) 
estimator, we use the PMLE B, =(X'WX)'X'Wy where, 
denoting by lower case the sampled elements (J; = 1), 
w, = Ww, for h=1,....H, i=1,..,n,, W = diag(w,,), 
xy; = (1 x,;)', X, contains the stacked rows of x,, and X 
contain the stacked matrices X,. We obtained inference 
about B via the standard Taylor Series approximation 
(Binder 1983): 


var(B,) = Sx 2B, Sey 

where S$ is a design-consistent estimator of the population 
total ©“, x! x, given by X'WX and d(B,) is a design- 
consistent estimate of the variance of the total Y/,e, x, 
where €; = y, — x,B is the difference between the value of 
y, and its estimated value under the true population slope 
B: >(B,)=>, 0, (n,-1) Si, = ,)(&,7—%;,), where 
X,; = W,;e,,X,; for e,, = y,, — x,,B,. We also consider 
the trimmed (TWT) estimator obtained by replacing the 
weights w,, with trimmed values w,, that set the maximum 
normalized value to 3: w,,=Nwi,,/Dji\n, Wi, where 
w,i=Min(w,,,3N/n), and the unweighted (UNWT) esti- 
mator obtained by fixing w,,= N/n for all A, i. 

Table 1 shows the relative bias, root mean square error 
(RMSE), and nominal 95% coverage for the three design- 
based and four model-based estimators of the population 
slope (second component of B) under consideration, as a 
function of the variance o°. 

The fully-weighted estimator of the population slope is 
essentially design-unbiased under model misspecification; 
the unweighted and trimmed estimators are biased. The 
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biases of the exchangeable and autoregressive models 
increase aS variance increases, as these models trade 
unbiasedness of the fully-weighted estimator for the reduced 
variance of the unweighted estimator. The linear and 
nonparametric model were approximately unbiased. 

The unweighted and trimmed weight estimators perform 
poorly with respect to MSE for small values of co”, where 
the bias due to model misspecification is critical, and well 
for larger values of 6”, where the instability of the fully- 
weighted estimator is more important than bias reduction. 
The exchangeable model-based estimator has good RMSE 
properties for small and large values of o”, with MSE 
reductions of over 30%, but oversmooths for intermediate 
degrees of model specification. The autoregressive model 
performance equals that of the exchangeable model for 
small and large values of o*, but is largely protected 
against the oversmoothing of the exchangeable models at 
intermediate levels. The linear and nonparametric models 
essentially dominated the fully weighted estimators with 
respect to MSE under all of the simulations considered, 
although MSE reductions were only on the order of 10%. 

The unweighted and trimmed estimators have poor 
coverage except when model misspecification in nearly 
absent. The failure of the bias-variance tradeoff for the 
exchangeable estimator in the presence of model misspec- 
ification is evident in the poor coverage of the estimator for 
intermediate values of o”; this effect is ameliorated, but not 
completely removed, for the autoregressive estimator. The 
linear and non-parametric estimators have good coverage 
when model misspecification is less important but 
undercover to some degree when model misspecification is 
more important. 


Relative bias (“), square root of mean square error (RMSE) relative to RMSE of fully-weighted estimator, and true 
coverage of the 95% confidence interval or posterior predictive interval of population linear regression slope 
estimator under model misspecification. Population slope and intercept are estimated via design-based unweighted 
(UNWT), fully-weighted (FWT), and weight-trimmed estimators (TWT), and as the posterior mean in (8) under an 
exchangeable (XRS), autoregressive (ARS), linear (LRS), and non-parametric (NPRS) prior for the regression 
parameters. MSE relative to the fully-weighted estimator less than 1 in boldface 


Relative bias (%) RMSE relative to FWT True Coverage 

Variance log, Variance log, Variance log, 
Estimator 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 
UNWT De OS 2 Dae 0 Sale 12.1 4.57 1.76 0.75 0.67 0 0 6 » IS.¥5,92 
FWT OL ee O eee IG APOE l 1 1 l l 94 95 96 94. 96 
TWT S250) erect: 4 9ON MES Suu 788 ATAY 1-88" T1029 OF7 P0575 Or ts" 78 954 96 
XRS O22 S14 SAL OSs 1.00 1.17 1.18 0.73 0.68 87 86 64 91: 96 
ARS 0.1 14 96 14.5 17.4 1.007,61.03.. 21911 #,0.745 0.69 STi SO) i Sine 9 OREO 
LRS OSD 0:4 i 1 Ok8-OS 0.99 0.91 0.91 0.91 0.93 852 91 96 Ose O4 
NPRS =ONlST 0:3 ah OLSON ou ie-014 0.89 0.90 0.95 0.90 0.95 86 92 96 94 94 


Statistics Canada, Catalogue No. 12-001-XPB 


30 Elliott: Bayesian weight trimming for generalized linear regression models 


3.2 Logistic regression 


For the logistic regression model, we generated popu- 
lation data as follows: 


P(Y, = 1|X,) ~ B(expit(3.25 - 0.75X, + yX?)), (12) 


X, ~ U(0,10), 7 =1,-.., N = 20,000. 


where B(p) is a Bernoulli distribution with probability of 
“success” p, expit(-) = exp(-:)/(1 + exp(-)). The object of 
the analysis is to obtain the logistic population regression 
slope, defined as the value B, in the equation >)‘ (y, - 
expit(B, + Bx, »| ¢ =0. An _ unequal probability of 
selection sampling scheme was implemented as described 
in the linear regression simulations. We consider values of 
y = 0, 0.0158, 0.0273, 0.0368, 0.0454, corresponding to 
curvature measures of K = 0, 0.02, 0.04, 0.06, 0.08 at the 
midpoint 5 of the support for X, where K(X;y)= 
|2y/{1+ (2yX — 0.75)’ }*? |; 200 simulations were gener- 
ated for each value of y. As in the linear regression 
simulations, elements were sampled without replacement 
with probability proportional to (1+4,/2.5)h,; a total of 
1,000 elements were sampled for each simulation. We again 
considered the PMLE-based the fully weighted (FTW), 
unweighted (UNWT), and trimmed weight estimator 
(TWT), along with the exchangeable random slope (XRS), 
autoregressive random slope (ARS), linear random slope 
(LRS), and nonparametric random slope (NPRS) estimators. 
Inference about the PMLE estimators is obtained via Taylor 
Series approximations (Binder 1983), as discussed in the 
previous section. 

Table 2 shows the relative bias, RMSE relative to the 
RMSE of the fully-weighted estimator, and true coverage of 
the nominal 95% CIs or PPIs associated with each of the 


Table 2 


seven estimators of the population slope (B) for different 
values of curvature K, corresponding to increased degrees 
of misspecification. 

The undersampling of small values of X meant that the 
maximum likelihood estimator of B in the model 
misspecification setting was unbiased for K =0 and biased 
downward for K =0.02,0.04,0.06, and 0.08 unless the 
sample design was accounted for. The trimmed estimator’s 
bias was intermediate between the unweighted and fully 
weighted estimator. The exchangeable estimator’s bias was 
between the trimmed weight estimator and fully weighted 
estimator; the autoregressive estimator’s bias between that 
of the exchangeable and fully weighted estimator; while the 
linear and nonparametric estimators were essentially 
unbiased. 

The unweighted estimator had substantially improved 
MSE (40% reduction) when the linear slope model was 
approximately correctly specified, but failed with moderate 
to large degree of misspecification. The trimmed weight, 
autoregressive, and nonparametric estimators all dominated 
the standard fully-weighted estimator, and the exchangeable 
and linear estimators nearly so, over the range of 
simulations considered. The crude trimming estimator 
yielded up to 30% reduction in MSE, the nonparametric, 
exchangeable and autoregressive estimators reductions of up 
to 20-25%, and the linear estimator reductions of only 10% 
or less. 

The unweighted estimator had poor coverage except 
when the linear slope model was correctly specified, or 
nearly so. The model-based estimators had generally good 
coverage properties when the linear model was correctly 
specified, with slight reductions in coverage when curvature 
was substantial. 


Relative bias (%), square root of mean square error (RMSE) relative to RMSE of fully-weighted estimator, and 
true coverage of the 95% confidence interval or posterior predictive interval of population logistic regression slope 
estimator under model misspecification. Population slope and intercept are estimated via design-based unweighted 
(UNWT), fully-weighted (FWT), and weight-trimmed estimators (TWT), and as the posterior mean in (8) under 
an exchangeable (XRS), autoregressive (ARS), linear (LRS), and non-parametric (NPRS) prior for the regression 
parameters. MSE relative to the fully-weighted estimator less than 1 in boldface 


Relative bias (%) 
Curvature K 


RMSE relative to FWT 


True Coverage 


Curvature K Curvature K 


Estimator 0 0.02 0.04 0.06 0.08 0.02 0.04 0.06 0.08 0 0.02 0.04 0.06 0.08 
UNWT 10 -4.9 -11.9 -21.6 -34.6 0.57 0.73 0.88 1.19 1.61 96°°2.389 66 “32h 
FWT Pele 227 13 FP -O5 Yo 1 ] ] ] l 95° 94>" 90 \O949R94 
TWT Oar <1LORTS St Val 2 0.70 0.77 0.77 0.78 0.95 08 -297 094 84782 
XRS 123 )-O.8O21.94 0 -3:.6.02887 0.75 0.82 0.85 0.88 1.02 7 eet. 92. ON eee) 
ARS 13 -0.5 -2.2 -48 -7.5 0.78 0.85 0.84 0.84 0.95 D4. 152 90 92 20 
LRS OS. LFA M.S 2520.4 FXiel 0.89 0.97 0.94 0.91 1.02 I a Pl OO Se ee 
NPRS O38) © 1 SSE Me Oo aa) 0.87 0.88 0.87 0.80 0.90 95% MOQINSS: 94eRS6 
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4. Application: Estimation of injuries to 
children in compact extended-cab 
pickup trucks 


The Partners for Child Passenger Safety dataset consists 
of the disproportionate, known-probability sample from all 
State Farm claims since December 1998 involving at least 
one child occupant < 15 years of age riding in a model year 
1990 or newer State Farm-insured vehicle (Durbin, Bhatia, 
Holmes, Shaw, Werner, Sorenson and Winston 2001). 
Because injuries, and especially “consequential” injuries 
defined as facial lacerations or other injuries rated 2 or more 
on the Abbreviated Injury Scale (AIS) (Association for the 
Advancement of Automotive Medicine 1990), are relatively 
rare even among children in the population of crash-related 
vehicle damage claims, a disproportional stratified cluster 
sample is used to select vehicles (the unit of sampling) for 
the conduct of a telephone survey with the driver. Vehicles 
containing children who received medical treatment 
following the crash were over-sampled so that the majority 
of injured children would be selected while maintaining the 
representativeness of the overall population. (Medical 
treatment is defined as treatment by paramedics, treatment 
at a physician’s office or emergency room, or hospital- 
ization.) If a vehicle was sampled, all child occupants in that 
vehicle were included in the survey. Drivers of sampled 
vehicles were contacted by phone and, if medical treatment 
had been received by a passenger, screened via an 
abbreviated survey to verify the presence of at least one 
child occupant with an injury. All vehicles with at least one 
child who screened positive for injury and a 10% random 
sample of vehicles in which all child occupants who were 
reported to receive medical treatment but screened negative 
for injury were selected for a full interview; a 2% (later 
2.5%) sample of crashes where no medical treatment was 
received were also selected. Because the treatment 
stratification is imperfectly associated with nsk of injury 
(more than 15% of the population with consequential 
injuries are estimated to be in the lowest probability-of- 
selection category and nearly 20% of those without 
consequential injuries are in the highest probability-of- 
selection category), the sampling design is informative, with 
unweighted odds ratios biased toward the null (Korn and 
Graubard 1995). In addition, the weights for this dataset are 
quite variable: 1< w, <50, where 9% of the weights have 
normalized values greater than 3. 

Winston, Kallan, Elliott, Menon and Durbin (2002) 
determined that children rear-seated in compacted extended 
cab pickups are at greater risk of consequential injuries than 
children rear-seated in other vehicles. However, quantifying 
degree of excess risk, and thus the size of the public health 
problem, was problematic. The unweighted odds ratio (OR) 
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of consequential injury for children riding in compacted 
extended cab pickups versus other vehicles was 3.54 (95% 
CI 2.01, 6.23), versus the fully-weighted estimator of 11.32 
(95% CI 2.67, 48.03). Because both injury risk and 
compacted extended cab pickup use were associated with 
child age, crash severity (passenger compartment intrusion 
and drivability), direction of impact, and vehicle weight, a 
multivariate logistic regression model that adjusted for these 
factors was also considered. The unweighted and fully- 
weighted adjusted ORs for injury risk in rear seated children 
in compacted extended cab pickups versus other vehicles 
are 3.50 (95% CI 1.88, 6.53) and 14.56 (95% CI 3.45, 
61.40) respectively. Utilizing the unweighted estimator was 
problematic because of bias toward the null induced by the 
informative sample design; however the fully weighted 
estimator appeared to be highly unstable. in part because of 
the presence of one consequential-injured child in the 
compact extended cab pickups had a very low probability of 
selection (0.025). In Winston et al. (2002), this child was 
removed before conducting the analysis. 

Table 3 shows the results for the unadjusted and adjusted 
odds ratios of consequential injury risk using the un- 
weighted, fully-weighted, and trimmed-weight design-based 
estimators, along with the model-based exchangeable, 
autoregressive, and linear regression slope models. (Results 
for the model-based estimators from 250,000 draws of a 
single chain after a 50,000 draw burn-in; convergence was 
assessed via Geweke (1992).) For the XRS and ARS 
models, p(X) ~ INVERSE-WISHART(p, 0.1/7), where 
p=2 for the unadjusted model and p=13 for the 
adjusted model. In the unadjusted results, the XRS and ARS 
estimators are intermediate between the unweighted and 
fully-weighted estimator, while the linear and nonparametric 
estimators tends to track the fully-weighted estimator. In the 
adjusted analysis, all three model-based estimators are 
intermediate between the unweighted and fully-weighted 
estimators, with the XRS estimator closest to the un- 
weighted estimator and the LRS estimator closest to the 
fully-weighted estimator. Based on the results of the 
simulation, it appears that the ARS estimator, which suggest 
relative risks of injury on the order of 7 for children in 
compact extended cab pickups relative to other vehicles, 
may be a better estimator of relative risk than either the 
unweighted or fully weighted estimator. (As a “sanity 
check” of sorts, we note that an additional two years of data, 
not available at the time of Winston etal. (2002), which 
included an additional 4,091 rear-seated children in 
passenger vehicles [44 in compact extended-cab pickup 
trucks], provided a fully-weighted unadjusted odds ratio for 
injury for children in compact-extended cab pickups of 6.3, 
and an adjusted OR of 7.0.) 
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Table 3 


Estimated odds ratio of injury for children rear-seated in compacted extended 
cab pickups (n= 60) versus rear-seated in other vehicles (1 = 8,060), using 
unweighted (UNWT), fully-weighted (FWT), weights trimmed to a normalized 
value of 3 (TWT), exchangeable random slope (XRS), autoregression random 
slope (ARS), linear random slope (LRS), and nonparametric random slope 
(NPRS) estimators; unadjusted and adjusted for child age, crash severity, 
direction of impact, and vehicle weight. Point estimates for XRS, ARS, and 
LRS models from posterior median. 95% confidence interval or posterior 
predictive interval in subscript. Data from Partners for Child Passenger Safety 


UNWT FWT 
Unadj. 3.54 (1,623) 11-32 (2.67, 48.02) 
Ad). 3.50 (138,653) 14.56 (3.45, 61.40) 
XRS ARS 
Unadj. 6.70 (9.51, 20.92) 8-69 (2.64, 21.05) 
Adj. 4.45 (239,867) 9-67 3.56, 11.94) 


5. Discussion 


The models discussed in this paper generalize the work 
of Lazzeroni and Little (1998) and Elliott and Little (2000), 
where population inference was restricted to population 
means under Gaussian distributional assumptions. Viewing 
weighting as an interaction between inclusion probability 
and model parameters opens up an alternative paradigm for 
weight trimming as a random effects model that smoothes 
model parameters of interest across inclusion classes. 
Models with exchangeable mean structures offer the largest 
degree of shrinkage or trimming but the most sensitivity to 
model misspecification; models with highly structured 
means are potentially less efficient but are more robust to 
model misspecification. This robustness property may be 
particularly important in light of the fact that elements of the 
large inclusion strata provide the largest degree of potential 
variance reduction in the model-based setting but are also 
subject to the largest degree of model bias and variance due 
to extrapolation. 

We consider simulations under varying degrees of model 
misspecification and informative sampling for both linear 
and logistic regression models. The linear and non- 
parametric smoothing models nearly dominated fully- 
weighted estimators with respect to squared error loss in the 
simulations considered. The exchangeable model showed 
some tendency to oversmooth, favoring variance reduction 
over bias correction, especially in the linear regression 
setting. All of the weight smoothing estimators tended to 
have less than nominal coverage when models were highly 
misspecified, although in no case was the nominal coverage 
catastrophically low. The autoregressive smoothing model, 
which allows for differential degrees of local smoothing 
across weight strata, appeared to provide non-trivial 
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TWT 
9.15 (2.65, 31.57) 
10.99 (2.97, 34.64) 
LRS NPRS 
11.17 (3.21, 24.94) 10.34 (3.27, 24.62) 


11.87 (3.33, 36.93) 10.23 (3.02, 37.93) 


increases in efficiency with limited risk of severe over- 
smoothing or undercoverage. 

Applying the methods to the Partners for Child Passenger 
Safety data to determine the excess risk of injury in a crash 
to rear-seated children in compacted extended-cab pickups 
relative to rear-seated children in other passenger vehicles, it 
appears that the decision in Winston efal. (2002) to 
eliminate a low probability-of-selection child from the 
analysis to stabilize the estimates was indeed conservative. 
Indeed, the ARS estimator, favored by MSE in simulations, 
suggests an adjusted excess risk of 6.7 with a 95% PPI of 
(3.6, 11.9), versus the 14.6 with 95% CI of (3.4, 61.4) of the 
fully-weighted estimator. 

Although this paper utilizes a fully Bayesian approach to 
inference about the posterior predictive distribution of the 
population regression slope, empirical Bayes (EB) estimates 
can also be obtained via ML or REML estimation using 
standard linear or generalized linear mixed model methods. 
In the Gaussian setting, the EB estimates of G and o” can 
be “plugged into” the closed-form expressions for 
E(B\y, X) and Var(B|y, X). The general exponential 
setting is more problematic. The plug-in estimates can be 
used to determine E(B |y, X) via root-finding methods; the 
lack of a closed form for E(B|y, X) makes it difficult to 
obtain model-based Empirical Bayes estimators for 
Var(B|y, X). Also, standard Empirical Bayes estimators 
do not account for the uncertainty in the estimation of G. 

We also note that, while computation of the actual 
trimming values of the case weights is unnecessary in this 
approach, it is possible to determine the revised design 
weights implied by the shrinkage. In the linear model 
setting, these can be obtained via a iterative application of a 
calibration weighting scheme such as generalized regression 
estimators or GREG (Deville and Sarndal 1992). The 
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general exponential setting required embedding the 
calibration weight algorithm within the iterative reweighted 
least squares (IRWLS) algorithm used to fit a generalized 
linear model. 

When sampling weights are used to account for misspec- 
ification of the mean in a regression setting, it could be 
argued that the correct approach is to correctly specify the 
mean to eliminate discrepancies between the fully-weighted 
and unweighted estimates of the regression parameters. 
However, perfect specification is an unattainable goal, and 
even good approximations might be highly biased if case 
weights are ignored when the sampling probabilities are 
highly variable. In the informative sampling setting, it may 
be impossible to determine whether discrepancies between 
weighted and unweighted estimates are due to model 
misspecification or to the sample design itself. Finally, even 
misspecified regression models have the attractive feature in 
the finite population setting of yielding a unique target 
population quantity. Consequently accounting for the 
probability of inclusion in linear and generalized linear 
model settings continues to be advised, and methods that 
balance between a low-bias, high variance fully-weighted 
analysis and a high bias, low variance unweighted analysis 
remain useful. 

The methods discussed in this paper show the promise of 
adapting model-based methods to attack problems in survey 
data analysis. Our goal is not to develop a single hierarchical 
Bayesian model finely-tuned to a specific or question 
dataset at hand, but to develop robust yet efficient methods 
that can be applied in a fast-paced “‘automated” setting that 
many applied survey research analysts must sometimes 
work. Although computationally intensive, the methods 
considered are applications or extensions of the existing 
random-effect model “toolbox,” and can either be im- 
plemented in existing statistical packages or executed with 
relatively simple MCMC methods. Our approach retains a 
design-based flavor in that we attempt to develop “auto- 
mated” Bayesian model-based estimation techniques that 
yield robust inference in a repeated-sampling setting when 
the model itself is misspecified. However, because these 
models rely on stratifying the data by probability of 
selection as a prelude to using pooling or shrinkage 
techniques to induce data-driven weight trimming, there is a 
natural correspondence between this methodology and 
(post)stratified sample designs in which strata correspond to 
unequal probabilities of inclusion. Developing methods that 
accommodate a more general class of complex sample 
designs that include single or multi-stage cluster samples 
and/or strata that “cross” the inclusion strata remains an 
important area for future work. 
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Semiparametric model-assisted estimation for natural resource surveys 


F. Jay Breidt, Jean D. Opsomer, Alicia A. Johnson and M. Giovanna Ranalli ' 


Abstract 


Auxiliary information is often used to improve the precision of survey estimators of finite population means and totals 
through ratio or linear regression estimation techniques. Resulting estimators have good theoretical and practical properties, 
including invariance, calibration and design consistency. However, it is not always clear that ratio or linear models are good 
approximations to the true relationship between the auxiliary variables and the variable of interest in the survey, resulting in 
efficiency loss when the model is not appropriate. In this article, we explain how regression estimation can be extended to 
incorporate semiparametric regression models, in both simple and more complicated designs. While maintaining the good 
theoretical and practical properties of the linear models, semiparametric models are better able to capture complicated 
relationships between variables. This often results in substantial gains in efficiency. The applicability of the approach for 
complex designs using multiple types of auxiliary variables will be illustrated by estimating several acidification-related 


characteristics for a survey of lakes in the Northeastern US. 


Key Words: Regression estimation; Smoothing; Kernel regression; Lake chemistry. 


1. Introduction 


Post-stratification, calibration and regression estimation 
are different design-based approaches that can be used to 
improve the precision of estimators when auxiliary 
information is available at the estimation stage. Model- 
assisted estimation (Sarndal, Swensson and Wretman 1992) 
provides a convenient framework in which to develop these 
and related survey estimators. Under that framework, a 
superpopulation model describes the relationship between 
the variable of interest and the auxiliary variables. This 
model is then used to construct sample-based estimators that 
have improved precision when the model is correct, but 
maintain key design properties such as consistency and an 
estimable variance when the model is incorrect. 

Until recently, the superpopulation models used in this 
context were formulated as parametric models, most often 
ratio or linear models. While reasonable in many practical 
applications, there are also many situations in which such 
relatively simple models are not good representations of the 
relationship between the variable of interest and the 
auxiliary variables. In Breidt and Opsomer (2000), a non- 
parametric model-assisted estimator was proposed based on 
local polynomial regression, which generalized the well- 
established parametric regression estimators. With this 
estimator, the superpopulation is no longer required to 
follow a pre-specified parametric shape. Instead, the 
relationship between the the variable(s) of interest in the 
survey and the auxiliary variable is required to be smooth 
(continuous), but is otherwise left completely unspecified. 


In the current paper, we formally extend the theory of 
Breidt and Opsomer (2000) to the semiparametric regres- 
sion context, in which some variables are incorporated 
linearly, and others are incorporated through smooth addi- 
tive terms. This extension makes their results more useful in 
practice, since auxiliary information is very often multi- 
dimensional in nature, and almost always contains category- 
ical variables that need to enter the regression model 
parametrically (through the use of indicator variables). An 
illustration of this is provided by a survey of lakes in the 
Northeastern states of the U.S. conducted by the 
Environmental Monitoring and Assessment Program of the 
US Environmental Protection Agency. In that survey, 334 
lakes were sampled from a population of 21,026 lakes 
between 1991 and 1996. We will apply the semiparametric 
model-assisted estimator to produce estimates of the mean 
and distribution function of the acid neutralizing capacity 
and other chemistry variables of interest. In this application, 
we will include in the model both categorical and contin- 
uous variables linearly and a continuous variable as a 
smooth additive term. 

In Opsomer, Breidt, Moisen and Kauermann (2007), the 
nonparametric model-assisted estimation principle was 
extended to generalized additive models (GAMs) and 
applied in an interaction model for the estimation of 
variables from Forest Inventory and Analysis surveys. 
While GAMs also contained a mixture of categorical 
(parametric) and nonparametric terms, a complete theo- 
retical development is not possible in the case of GAMs, 
and was therefore not provided there. The semiparametric 
model considered in this article can be viewed as a special 
case of a GAM with an identity link function. Unlike the 
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“general” GAM, the semiparametric model allows for 
formal derivation of the statistical properties of the model- 
assisted estimator. 

The remainder of the article is structured as follows. In 
Section 2, the semiparametric model-assisted estimator is 
defined. Section 3 states and proves the design properties of 
the estimator. Section 4 describes the application of semi- 
parametric model-assisted estimation to the Northeastern 
Lakes data. Section 5 provides a conclusion. 


2. Semiparametric model-assisted estimation 


We begin by considering the superpopulation model with 
a single univariate nonparametric term and a parametric 
component; extension to several nonparametric terms is 
addressed in Section 3.2. The parametric component can be 
composed of an arbitrary number of linear terms. This 
model is the semiparametric model studied by Speckman 
(1988), among others. This superpopulation model, which 
we denote by €, can be written down as 


E.(¥,) = & (Xj. 2) = m(x;,) + 2B 
Var. (¥,) = V(%; z,) (1) 


with x, a continuous auxiliary variable to be modelled 
nonparametrically and z, =(Z,, ..-. Zp,) a vector of D 
categorical or continuous auxiliary variables that are 
parametrically specified. The functions m(-) and v(,, -) 
and the parameter vector B are unknown. For identifiability 
purposes, we will assume that the vector z, contains an 
intercept term, and that the function m(-) is centered 
around 0 with respect to the distribution of the x,. We will 
derive the model-assisted estimator that uses model (1) by 
first defining population-level estimators for the unknown 
functions and parameters, and then constructing sample- 
based estimators. This is the same approach used for the 
parametric case in Sarndal et al. (1992, Chapter 6). 

Let U = {l, 2, ..., N} represent the ordered labels for a 
finite population of interest. As the population estimator for 
2(x,, 2,), we will use the backfitting estimator described 
in Opsomer and Ruppert (1999). We first introduce the 
required notation. Let K(-) represent a kernel function used 
to define the neighborhoods in which the local polynomials 
will be fitted (assumptions on K are specified in the 
Appendix). The population smoother vector for local 
polynomial regression of degree p at x, is defined as 


He mie, (XG Wer 2 ed Bed eri Wun 


with e, a vector of length p+1 witha 1 in the first position 
and Os elsewhere, W,, = diag {h"'K((x, —x,)/A), « 
h'K((xy —x,)/h)} and 
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aka Sa (Xa) 
The smoother s,, can be applied to the vector Y, = 
();,-.:; ¥y)’ to produce the nonparametric regression fit 
with respect to the variable x at observation x,. It can also 
be applied to any of the columns of Z,, =(z/,..., zy)’ to 
smooth those with respect to x. This will be done in the 
derivation of the properties of the semiparametric estimator 
(Section 3). 

In addition to the smoother vector at x,, 5{;, we also 
need to define the smoother matrix at all the observation 


POINTS "X45. -, Xin, 


and the centered smoother matrix Ss? =(1-11'/N Si: 
When the smoother matrix is applied to Y,,, it produces the 
vector of nonparametric regression fits at all the observation 
points. The centered smoother matrix S;, produces 
centered fits, ie., the overall mean of the fitted values is 
subtracted from each fitted value. The centering is used to 
maintain identifiability of the estimators, as explained in 
Opsomer and Ruppert (1999). 

For any observation x,, a possible estimator of m(x,) 
could be defined as s/, Y,,, with or without a centering 
adjustment. This estimator would generally be poor, since it 
does not take into account the fact that the y, contain a 
parametric component that depends on the z,. A more 
efficient estimator is provided by jointly estimating both 
m(-) and B, as is done by the following set of estimators 


B=(Z),(1-S,)Z,) |Z, -S,)¥, 
Mio Si Yo 2B) (2) 


In these estimators, B is calculated first, and then the 
“residual vector” Y,, — Z,,B is smoothed with respect to x. 
The estimators in (2) are identical to the backfitting 
estimators for additive models described in Hastie and 
Tibshirani (1990) and implemented in gam in S-Plus, R or 
SAS. As a population estimator for E,(y,) = g(x|k, 2); 
we use 

g, =m, + 2,B. 


We now explain how to construct a model-assisted 
estimator based on the semiparametric regression approach. 
Let ACU beasample of size n drawn from U according 
to sampling design p(A) with one-way and two-way 
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inclusion probabilities m, =d4,,p(A), By = Dase.1 P(A), 
respectively. If the g,, k=1,..., N were available, it 
would be possible to construct a difference estimator for the 
population mean of the y,, Vy =Xuy, /N, as 


i | I Ve —~ &k 
~=— +—) +, 3 
Vif N D8 N 2 i (3) 
which is design unbiased and has design variance 


2 I Mane KT Si 
Var, Var) => PY ae 15 5,3) ager Serene sak 
» Faire) N?2 Ld ( kl k 1) Te, t, 


(Sarndal et al. 1992, page 221). The design variance is small 
if the deviations between y, and g, are small. This 
estimator is not feasible, since it requires knowledge of all 
the x,, z, and y, for the population to calculate. Instead, 
we will construct a feasible estimator by replacing the g, 
by sample-based estimators. The sample-based estimators 
corresponding to the population estimators in (2) are 
constructed as follows. The design-weighted local poly- 
nomial smoother vector is 


Sue = et Xuk W x xia" XN W awe (4) 


with X ,, containing the rows of X_, corresponding to the 
ke A and 


ae ite te aga) “ahsee Steg ly 3 pains 
Ak g Figg ri ny), : 


J 


The matrix X1,W,, X,, in (4) will be singular if, for 
some sample A, there are less than p+1 observations in 
the support of the kernel at some x,. This issue can be 
avoided in practice by selecting a bandwidth large enough to 
make that matrix invertible. However, this situation cannot 
be excluded in general and we need an estimator that exists 
for every sample A for the theoretical derivations of 
Section 3. Hence, we will consider the following adjusted 
sample smoother vector 


Be eC WX et cag (ON® \y eX 7, Wont 95) 


for some small 5>0, as done in Breidt and Opsomer 
(2000). The sample smoother matrix and its centered 
version are 


Be lsia Sls Strobl: Lh VS , 


with II, =diag{m,:k eA}. The design-weighted esti- 
mators for B andthe m, are 


B= (Z| NU -S,)Z,)'Z, 0 U-S)¥, © 


mn, = Sql B), (7) 
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where Z, and Y, denote the sample versions of Z,, and 
Y,,, respectively. Note that the estimator m, is defined for 
any x, in the population, not only those appearing in the 
sample. As for the population estimators, these estimators 
can again be written as the solution to backfitting equations, 
so that they can be calculated by appropriately weighted 
versions of the existing algorithms.The estimator for g, is 
ga as z,B. 

The semiparametric model-assisted estimator is_ then 
constructed by replacing the g, in(3) by the g,: 


A l a l Win ee 
= — apne eehg 2D EH 8 
Vreg N 2 &k N 2 Ty, ( ) 


Defining ¥,=>,y,/m, and similarly for Z,, an 
equivalent expression for f,,, is given by 


Papa Tet Gua Te) Ru e> Yoh, a Dy Ht 0) 
which shows that the semiparametric estimator can be 
interpreted as a “traditional” linear regression survey 
estimator using the parametric model component zB, with 
an additional correction term for the nonparametric 
component of the model. This estimator also shares some 
desirable properties with the fully parametric regression 
estimators. It is location and scale invariant, and it is 
calibrated for both the parametric and the nonparametric 
model components, in the sense that %,, =X, and 
Z.eg = zy: The calibration for the variables in the parametric 
term can be checked directly by using expressions (6) and 
(7), while the calibration for the nonparametrically specified 
variable x, follows from the fact that s1, X, = x,, where 
», GA ee ke A)’ (we are ignoring the effect of the 
adjustment diag(8N ~~) in (5), because that adjustment can 
be made arbitrarily small). In addition, the estimator can be 
written as a weighted sum of the y,, k € A, so that a set of 
weights w, can be obtained and applied to any survey 
variable of interest. 


3. Properties and extensions 
3.1 Design properties 


In this section, we explore the design properties of the 
semiparametric estimator (8). In particular, we prove that 
Yreg 1S design Jn -consistent, and we derive its asymptotic 
distribution, including an estimated variance. This will be 
done in the design-asymptotic context used in Isaki and 
Fuller (1982) and in Breidt and Opsomer (2000), in which 
both the population and the samples increase in size as 
N — oo. All proofs and the necessary assumptions are in 
the Appendix. 


Statistics Canada, Catalogue No. 12-001-XPB 


38 Breidt, Opsomer, Johnson and Ranalli: Semiparametric model-assisted estimation for natural resource surveys 


In the following theorem, we prove the design 
consistency of the semiparametric estimator. We also show 
that the convergence rate is Jn, the usual rate for design 
estimators. 


Theorem 3.1 Under the assumptions A\— A8, the esti- 
mator J... in (8) is design consistent with rate Jn, in the 
sense that 


1 
Vaan) One 
8 N Al Jn 
The following theorem proves that a central limit 


theorem for $,,, exists whenever it exists for the expansion 
estimator V,. 


Theorem 3.2 Under the assumptions A\- A8, if 
Va N 


VV ,) 


+ N(0, 1), 


with 


Tig — Me™ Vie Yi 
Tey 1, Ty 


V(¥,) = ma N2 ee 


for a given sampling design, then we also have 


EA ESN (0.1), 
nl Dros) 
with 
An ] Ty = My Thy Ve — 2, Vi 
VOL) SS He ate tS (10) 
: N? 2» 2 To Ty Me 


3.2 Semiparametric additive model 


The results in Theorems 3.1 and 3.2 use the semi- 
parametric model (1), which contains a single univariate 
nonparametric term m(-). In many practical applications, 
several auxiliary variables will be available that could be 
included in the nonparametric portion of a model, but the 
curse of dimensionality makes it often difficult to combine 
several variables into a single multi-dimensional non- 
parametric term. Instead, the variables that are to be 
included nonparametrically will be treated as univariate 
components. This results in the semiparametric additive 
model, which is written as 


Ee (¥y) = 8 (Xp. &) = (Hy) +--+ Mg (XQ) + 7B 
Var, (y,) = V(X, 2%) 
where the m,(:), q=1, . 
smooth functions. 


When Q=2, expressions similar to (6) and (7) can be 
developed, using the additive model decompositions of 


.» Q and v(-,-) are unknown 
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Opsomer and Ruppert (1997), and for Q>2, recursive 
expressions can be derived using the approach of Opsomer 
(2000). The estimator would then be written as in equations 
(6) and (7), but with the smoother vectors s,, and smoother 
matrix §, replaced by complicated higher-dimensional 
additive model smoothers (see Opsomer (2000) for details). 
Because of this, formally proving the properties of the 
model-assisted estimator for the case with arbitrary QO 
would be a challenging task beyond the scope of the current 
article. 

In practice, the backfitting algorithm formulation 
provides a much more efficient and simple way to calculate 
the semiparametric estimator. Let s,,, represent the sample 
smoother vector, as defined in (5), for the variable x, at the 
observation x, and S,, is the corresponding smoother 
matrix for the variable x,. Also, m,, denotes the sample- 
weighted backfitting estimator for m,(x,,) and m4, = 
(H,,, k € A). The backfitting algorithm for a model 
including Q nonparametric terms consists of the following 
set of equations, iterated to converge: 


. Q 
B =(Z';Z,) ZU, [, ha | 


Hg = Sip [», -Z/B-> wi 
q#Q 
These equations provide weighted fits at the sample 
locations ke A only. For the remaining locations k «U 
not in A, an additional smoothing step is required after 
obtaining the m,,, q =1, .... Q: 


Apo eT: Tp n 
My niy[¥,- 218-5 | 
q'#q 


The sample-based estimators for the mean function at all 
keU are then defined as 8, =m, +...+ Mgt z,B, 


which are used in expression (8) to construct the model- _ 


assisted estimator. 


4. Application to Northeastern Lakes survey 


In this section, we will show the applicability of the 
semiparametric regression estimator on a dataset of water — 


chemistry samples. As will be illustrated, once a set of 
auxiliary variables and a model has been selected, 


computing survey estimators for the semiparametric model — 


is as easy as for linear models, and hence can lead to 
improved precision for relatively little cost. 


| 
| 
} 
} 
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The National Surface Water Survey (NSWS) sponsored 
by the U.S. Environmental Protection Agency (EPA) 
between the years of 1984 and 1986 estimated 4.2 percent 
of the lakes in the northeastern region of the United States to 
be acidic (Stoddard, Kahl, Deviney, DeWalle, Driscoll, 
Herlihy, Kellogg, Murdoch, Webb and Webster 2003). 
Acid-sensitive Northeastern lakes were among the concerns 
addressed by the Clean Air Act Amendment (CAAA) of 
1990, which placed restrictions on industrial sulfur and 
nitrogen emissions in an effort to reduce the acidity of these 
waters. A common measurement of acidity is acid 
neutralizing capacity (ANC), which is defined as a water’s 
ability to buffer acid. An ANC value less than zero preq/L 
indicates that the water has lost all ability to buffer acid. 
Surface waters with ANC values below 200 peq/L are 
considered at risk of acidification, and values less than 50 
ueq/L are considered at high msk (National Acid 
Precipitation Assessment Program (1991), page 15). 

Between 1991 and 1996, the Environmental Monitoring 
and Assessment Program (EMAP) of the U.S. Environ- 
mental Protection Agency conducted a survey of lakes in 
the Northeastern states of the U.S. These data were collected 
in order to determine the effect that restrictions put in place 
by the CAAA had on the ecological condition of these 
waters. The survey is based on a population of 21,026 lakes 
from which 334 lakes were surveyed, some of which were 
visited several times during the study period. Multiple 
measurements on the same lake were averaged in order to 
obtain one measurement per lake sampled. Lakes to be 
included in the survey were selected using a complex 
sampling design commonly employed by EMAP based on a 
hexagonal grid frame (see Larsen, Thornton, Urquhart and 
Paulsen (1993) for a description of the sampling design). 

Let y, represent the (possibly averaged) ANC value of 
the k'" sampled lake. A very simple estimate of the ANC 
mean of the lakes is represented by the expansion estimator 
y,-. In this as in many surveys, a better choice is the Hajek 
estimator, 


pea ake 1 

On nN i Th, , (11) 
which applies a ratio type adjustment for the estimation of 
the population size through N= %,.41/n,. However, 
auxiliary variables are available for each lake in this 
population, so that it should be possible to further improve 
upon the efficiency of the Hajek estimator. The following 
variables are available for each k € U: 


x, = UTMX, x-geographical coordinate of the 
centroid of each lake in the UTM coordinate 
system, 

Z,, = indicator variable for eco-region j =|, ..., 6, 
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Z,, = UTMY, y -geographical coordinate, 
Za, = elevation. 


There are seven different eco-regions included in the 
population, thus dummy variables z,, are constructed for 
j=l, ..., 6. A semiparametric regression estimator for the 
variable y will be constructed by treating the UTMX 
variable x as a nonparametric term and the remaining 
variables z,—Z, aS a parametric component. Model 
selection was used to determine that treating the other two 
continuous variables as nonparametric did not improve the 
model fit. For comparison purposes, we also computed a 
regression estimator that treats all terms as parametric. This 
estimator is therefore identical to the semiparametric 
estimator, except that the x-geographical coordinate is 
modeled linearly. We will denote this fully parametric 
regression estimator by ,... 

In order to determine the estimated efficiency of survey 
estimators, we need to compute the variance estimates. 
However, second order inclusion probabilities were not 
available, thus we cannot evaluate V (Deg) as in (10). In 
order to come up with appropriate variance estimates, we 
treat the complex sampling design as a stratified sample 
taken with replacement. The 14 strata we _ selected 
correspond to groups of spatial clusters of lakes that 
appeared in the original design, and that were used to ensure 
spatial distribution of the sampled lakes over the region of 
interest. Larsen etal. (1993) provide details on the 
construction of the spatial clusters. 

Let H be the number of strata, n, the number of 
observations within stratum h, and A, the set of sampled 
elements that fall in stratum h. Define p, =n,'7,. Using 
this notation and the assumption of a stratified sample with 
replacement, we rewrite the semiparametric estimator as 


Ya N o 2x, Zz) 
I l Vz — EX. %) 
peg ES ; (12) 
heH nN, keA, Pr 


and the variance estimator as 

a ] 3 

V0...) = — ie 

res) = 33 py, 

where S; is the estimated within-stratum weighted residual 
variance for stratum h. Assuming the strata are sampled 
with replacement, Sarndal etal. (1992, page 421 -422) 
suggest S ‘ can be calculated as 


VY, — 8 Xen 24) 
gate aifacbw od Pk \ agi Tf) 
n,(n, —1) ‘ea, i by ¥ = & (Xp 2) 
leA, 7 
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Similarly, we estimate V (f,,) through 


V (Sy) = 
x 2 
Vy 
ey She tats ‘i dig GE) 
N ich (1, —1) bea, iy x See ie 


and the expression for V ( Pyar) 1S Obtained completely 
analogously as for V(f,,,) except that g(x, z,) is 
computed by linear regression. 

This setup allows us to obtain the following estimates of 
mean ANC for the Northeastern lakes, together with 
variance estimates and approximate 95% confidence 
intervals (CI). A local linear fit has been employed for the 
nonparametric term with bandwidth set at one tenth of the 
range of UTMX. 
reg = 558.0 peq/L 


V (reg) = 2534.6 CI = (459.3;656.6) 


nar = 577.3 peq/L V(Pya,) =3239.6 CI = (465.8;688.9) 


9 =055.9 peq/L V($,) = 4313.3 Cl=(427.2;684.7) 

The confidence interval constructed using the Hajek 
estimator is about 31% wider than that constructed using the 
semiparametric estimator, while the interval for the fully 
parametric regression estimator is 13% wider. These results 
show evidence of an improvement in efficiency provided by 
accounting for the auxiliary information in both a 
parametric and nonparametric way in the mean estimation 
procedure, with the nonparametric estimator able to capture 
some additional efficiency beyond that of the parametric 
estimator. 

As mentioned above, an important goal of this 
application is the assessment of how many lakes are at risk 
of acidification or are acidified already. That is, we are 
interested in estimating the proportion of Northeastern lakes 
with ANC values smaller than some specific threshold 
values. We can determine such proportions by estimating 
the finite population distribution function, 


1 
Fy (t) = WH a linsn 


at specific threshold values ¢, where J,, <,, denotes the 
indicator function taking a value of 1 if y,<¢ and 0 
otherwise. Because all three estimators can be expressed as 
weighted sums of sample observations, the weights obtained 
for each can be applied directly to the /,, <,, for the sample 
to estimate F(t) for any desired ¢. Let us denote by 
F,,(t), Fig(t) and F,,,(¢) the Hajek, semiparametric and 
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parametric regression estimators of the distribution function, 
respectively. Estimates for their design variances are 
computed by plugging the indicator variables in equations 
(13) and (14). 

Figure 1 shows estimates of the ANC cdf produced by 
F(t), F.,,(t) and 1a (t) evaluated on a grid of 1,000 
equally spaced values for t. Included are their respective 
pointwise 95% confidence intervals calculated at each grid 
point. All three estimators are similar, but the confidence 
bands for the parametric and semiparametric regression 
estimators tend to be narrower. Averaged over all 1,000 
grid points, the widths of the confidence bands are 0.093 
for F,,(¢), 0.084 for F(t) and 0.075 for F..,(t), 
respectively. 


0 1,000 
ANC 


2,000 3,000 


Figure 1 

Estimates of the population cumulative distribution function for 
ANC and confidence bounds produced by Hajek, parametric 
and semiparametric regression estimators 


Along with ANC, the EMAP survey of Northeastern 
lakes measured the concentration of multiple chemistry 
variables including sulfate, magnesium and chloride, so that 
the survey weights obtained for ANC can also be applied to 
these concentrations as well as their respective cdfs. As 
another illustration of the semiparametric estimation 
approach, it is possible to “invert” F. (t) to obtain quantile 
estimators 0,,(a)=min{t: F.,(t)2a} of these addi- 
tional chemistry variables. Table 1 displays semiparametric 
estimates of the first, second, and third quartiles of sulfate, 
magnesium, and chloride measured in (ueq/Z). Variance 
estimation for these quantiles could be handled using 
asymptotic results of Francisco and Fuller (1991), but will 
not be explored further here. 


Table 1 Quartile estimates of chemistry variables 
a Sulfate Magnesium Chloride 


0.25 Toes 63.8 27.4 
0.50 104.3 127.0 162.2 
0.75 201.4 2219 462.2 
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5. Conclusion 


In this article, we have described a model-assisted 
estimator that uses semiparametric regression to capture 
relationships between multiple population-level auxiliary 
variables and the survey variables. We have developed 
asymptotic theory that shows the resulting estimator is 
design consistent and asymptotically normal under mild 
conditions on the design and the population. This 
generalizes the results of Breidt and Opsomer (2000), who 
had proved similar results for a univariate nonparametric 
model-assisted estimator. The semiparametric estimator was 
applied to data from a survey of lakes in the Northeastern 
U.S., where it was shown to be more efficient than an 
estimator that does not take advantage of the auxiliary 
variables and than a fully parametric regression estimator. 

In addition to its theoretical properties, the semi- 
parametric model-assisted estimator has attractive practical 
properties as well. As noted earlier, it is fully calibrated for 
the auxiliary variables, whether used in the parametric or 
nonparametric model components, and it is location and 
scale invariant. The estimator can be expressed as a 
weighted sum of the sample observations, so that it 
conforms to the traditional survey estimation paradigm and 
a single set of weights can be applied to all the survey 
variables, hence preserving relationships between variables. 

One issue which was not addressed in the current article 
is the selection of the smoothing parameter for the 
nonparametric component of the regression model. This is a 
challenging topic in the model-assisted context, further 
complicated by the just mentioned fact that a single set of 
survey regression weights is applied to all the survey 
variables: because the optimal bandwidth choice depends on 
the variable being smoothed, no single bandwidth (and 
hence set of weights) will be optimal for all variables in the 
survey. This topic is currently being explored by the 
authors. 


Acknowledgments 


The research for this article was supported by National 
Science Foundation grants DMS-0204531 and DMS- 
0204642, and by STAR Research Assistance Agreements 
CR-829095 and CR-829096 awarded by the US. 
Environmental Protection Agency (EPA) to Colorado State 
University and Oregon State University. This manuscript 
has not been formally reviewed by EPA. The views 
expressed here are solely those of the authors. EPA does not 
endorse any products or commercial services mentioned in 
this report. 


41 


Appendix 
Technical assumptions and derivations 


We begin by stating the necessary assumptions, which 
extend those used in Breidt and Opsomer (2000) to the 
semiparametric model. 


Assumptions: 


Al Distribution of the errors under &: the errors &, 
are independent and have mean zero, variance 
v(x,, Z,), and compact support, uniformly for all N. 
A2 Distribution of the covariates: the x, and z, are 
considered fixed with respect to the superpopulation 
model & The 2, are assumed to have bounded 
support, and the x, are independent and identically 
distributed F(x) = J, f(t)dt, where f(-) is a density 
with compact support [a,, b,| and f(x)>0 for all 
xé[a,, b,]. 
A3 Nonparametric mean and variance functions: the 
mean function m(-) is continuous, and the variance 
function v(-,-) is bounded and strictly greater than 0. 
A4 Kernel K: the kernel K(-) has compact support 
[-1, 1], is symmetric and continuous, and satisfies 
f., K(u)du =1. 
AS Sampling rate nN" and bandwidth h,: as 
N>0o, nN"' > xe (0, 1), hy, > 0 and 
Nhy, /og log N) > . 
A6 Inclusion probabilities n, and t,: for all N, 
Min,cy, % 2A>0, min, jy, TM, 2 i” >0 and 
limsupn 


N>o 


max Ty, = 1, 1 < ©, 
Rial kl k 1 | 


A7 Additional assumptions involving higher-order 
inclusion probabilities: 


: 2 
lim n 
No 


[Ei ~~ a, a, ~ 24, ) Le, - 


max 
(ky, ky, ks, ky EDs, v 
<0, 

where D, y denotes the set of all distinct t -tuples 
(Oc uk TOOL Oi 

lim 

Nox 


max [Ei Ly > Mee Ui, Ly, Tz, ) | = 9, 


(ky; Roliley, hy) € Diy 


and 
limsup7n 


Noo 


max 
(ki, ky, ky) € Dy y 


A8 The matrix No'Z, (1 — SZ; is invertible for all 
N with model probability |. 


JE, (Iy,— ,)° (Ig, ,) Up, ™, I< ©- 
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Assumption A8 is required so that the population 
estimator B is well-defined. The invertibility of the matrix 
in A8 depends on the combined effect of the bandwidth h 
and the joint distribution of the x, and z,. While it would 
in principle be possible to write down sufficient conditions 
for this, we opted for this simpler and more explicit 
approach. 

Before giving the proofs of Theorems 3.1 and 3.2, we 
state and prove a number of lemmas. 


Lemma 1 Under the assumptions A1\-A7, 
(a) forall k eU and d=1,..., D, 


1 1 
os E, (Six ¥, Siz Fy) i [=] 
N ‘G 


nh 


1 
VA A ase o(=): 
nh 


(b) the s{,¥,, and si,Z,, are uniformly bounded over all 
kev. 


Proof of Lemma 1: Since both the y, and z, are 
bounded by assumption, part (a) can be shown using an 
identical reasoning as in Lemma 4 of Breidt and Opsomer 
(2000). While that lemma did not include a rate of 
convergence, that rate is readily derived by noting that 


and 


| T 
Ni 2 E, (Sax 


i iho 3 -o[— | 

N ikeUy nh 

in the notation of Breidt and Opsomer (2000) and then 
proceeding as in that proof. 

Part (b) was proven directly in Lemma 2 (iv) of Breidt and 
Opsomer (2000). 


Lemma 2 Under assumptions A\-A8, 

B= B+0,(1/Vnh), 
with the rate holding component-wise, and B is bounded 
forall N. 


Proof of Lemma 2: Write jh! =s/, ¥, and jl) = 
S,,¥, for the population and sample smoothed versions of 
y,; ‘and: similarly, ©20¢ a's! 0Z,s'and zis), Z,. We 
rewrite expression (6) as a function of sample-weighted 
ferns cal 


where 
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r Bess ZZ has z, a ale pi zital 
Nar ae hey cae he Nee 
toon ta) ish iee Ni 

t a ] 

ae. Ne Tr, N 

; =i 2 Ve Fs Ep +2 Ly ye 
EE PON, ps ora 


The sample-weighted estimator B will be expanded around 
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and the remaining ¢, can be found in (15). The existence 
and continuity of the derivatives of B with respect to the #, 
and evaluated at ¢, follow from Lemma 1(b) and the 
existence of the inverse in (15), which is assumed by A8. 

The result will follow from a Oth order Taylor expansion 
if we can show that #, -—t,=O,(1/Vnh) for all / (eg., 
Fuller (1996), Corollary 5.1.5). For /, and #,, this follows 
directly from A2 and A6. The remaining terms contain sums 
involving smoothed quantities 2°) and jl! We 
demonstrate the reasoning for one of those terms in /,. We 
have 

1 A angel 


iad 2) Vion T [sy] _ I T =[sy] Jey 
ape iio Z Vk me ox Vk E 


where 


A Ty U k 


eo z) (pte i 
k 


and the first term is O,(1/Vn) by A6 and Lemma 1(b), 
using the same argument as in Lemma 4 of Breidt and 
Opsomer (2000). For the second term, use Schwarz’s 


inequality 
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where z!’) denotes that the squares are computed 
component-wise. The first term is bounded by A2 and A6, 
and the second term is O,(1/ Vnh) by Lemma l(a) and 
Markov’s inequality. The desired result then follows by 
applying the same reasoning to the remaining terms in 
Pais bch; 

The boundedness of B follows directly from assumption 
A8, Lemma 1(b) and the boundedness of the z,. 


Lemma 3 Under the assumptions A\-A8, we have 


Pe x 1 
Vee = ire +O hs, 
Proof of Lemma 3: Given expression (9), we need to 
show that 


(Zy —z,)(B- B)= 0) (—- (16) 


ml ty hpi eo base eee 
Dae Aa i. °,(+e} (17) 


Lemma 2 and assumptions A2, AS and A6 show that 
oie) LB = B)= O,,(1/nh). In order to prove (17), we 
can rewrite it as 


I 
a (m, — m, ) (1-4) -4 2 (Ja -5ps[1- 


T 
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The first term on the right hand side has been proven to be 
o,(1//n) in Lemma 5 of Breidt and Opsomer (2000); this 
same Lemma and boundness of B provide the same rate 
for the second term. Assumptions A5-A6, Lemma 1(b) and 
Lemma 2 show that the third term is O,(1/nVJh) and the 
desired rate is achieved. 


Lemma 4 Under assumptions A6 and A8, 
Ey (Vaig )= Yn 


5 l LA iar 5 ied bad Sk 
Var, (Var) => a Fa ea 
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Proof of Lemma 4: The properties of the difference 
estimator are readily computed. The rate of the design 
variance follows from the stated assumptions using the same 
reasoning as in Lemma 4 of Breidt and Opsomer (2000). 


Lemma 5 Under assumptions A\-A8, 
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Proof of Lemma 5: The reasoning for this proof will 
closely follow that of Theorem 3 of Breidt and Opsomer 
(2000). We write 


V (Seow) 7 Var, (Suir) F (45) cs Aire )) 
+ V (Sac) — Var, (Sir) (18) 
with 
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by assumptions Al-A3 and from Lemmas 1(b) and 2, the 
approach used for the term A, of Breidt and Opsomer 
(2000) can be used to show that 


al y 1 
E ,|V Deir) — Var, Dar) = Oh arts 
n 
which provides the desired consistency by the Markov 


inequality. 
For the first term in (18), note that 


oper, ay, a ee) a) 
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can be decomposed into variance terms involving sample 
and population smooths and parameter estimators. Each of 
these terms can be shown to be o,(1/n). We demonstrate 
the approach on one of the terms: 

2, — 2 My — 1,7 - 
&x 2) — SF Muy k ‘7, 1,(B-B) 
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where C,, C, <0 summarize the bounded terms (by 
assumptions Al-A3 and A6 and Lemma 1(b)), and the rate 
of convergence is the result of assumption A6 and Lemmas 
l(a) and 2. 

Proof of Theorem 3.1: In Lemma 3, we show that 


@ 1 
Vreg — Yair +0, (+=) 


where f,, is the difference estimator (3). The result 
immediately follows from assumption A5 and Lemma 4. 

Proof of Theorem 3.2: Note that j,, can be written as the 
sum of a population constant and an expansion estimator of 
the form y, by defining a new variable y, — SEY, + 
Si,ZyB-%,B for keU. As is the case for the original 
y,, this new variable has bounded support by Lemma 1(b) 
and a variance of order O(1/n) by Lemma 4. Hence, 
existence of the CLT for y, implies existence of the CLT 
for Jaic- AlSO, Seog = Suir +O, (1/Vn) by Lemma 3, so that 
Vn Seg and Jn fy, have the same asymptotic distribution. 
Applying Slutsky’s Theorem and Lemma 5 complete the 
proof. 
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Ex post weighting of price data to estimate depreciation rates 


Marc Tanguay and Pierre Lavallée ' 


Abstract 


To model economic depreciation, a database is used that contains information on assets discarded by companies. The 
acquisition and resale prices are known along with the length of use of these assets. However, the assets for which prices are 
known are only those that were involved in a transaction. While an asset depreciates on a continuous basis during its service 
life, the value of the asset is only known when there has been a transaction. This article proposes an ex post weighting to 
offset the effect of source of error in building econometric models. 


Key Words: Price ratio; Survival data; Uniform distribution; Depreciation of vehicles. 


1. Introduction 


Various econometric models are used to estimate 
economic depreciation. To this end, we use a database 
containing information on assets discarded by companies. 
The acquisition and resale prices are known along with the 
length of use of these assets. From this information, we 
would like to infer results for the total population of assets 
used by companies. Regarding the use of the prices of used 
assets to estimate economic depreciation, we refer the reader 
to, Gellatly, Tanguay and Yan (2002) and Hulten and 
Wykoff (1981). 

We question, however, the representativeness of the 
database used. Indeed, the assets for which prices are known 
are solely those subject to a transaction. We do not know the 
extent to which the losses of value observed on these assets 
are representative of the loss of value for all assets in 
production, regardless of whether they were the subject of a 
transaction. This situation can be a source of error in 
building econometric models because these models seek to 
measure depreciation of assets over their service lives, 
regardless of whether there was a transaction. 

It is this second source of error that we propose to offset, 
at least in part, by applying ex post weighting when building 
econometric models. Section 2 of this article will describe 
the problem in greater detail, while in Section 3, we will 
describe the approach used to determine the weights. 
Finally, in Section 4, we present some numeric results. 


2. Problem 


We are seeking to describe the relationship between 
prices and asset age. There is a sample of n assets where we 
know, for each asset i, the price ratior, and the time f; 
when this ratio was measured. Once prices are expressed in 


real dollars, this ratio is given as r, = P'/P° where P° is 
the initial value of the investment in asset i and P' is its 
resale price at time ¢. This ratio is strictly decreasing in 
relation to the time axis ¢. At the start, we do not know the 
process that generates the loss in value and there are no 
specifics about the function that describes this loss except 
that it is strictly decreasing. However, it is possible to 
examine the distribution of the price ratios between 0 and 1. 
Here is an example constructed from data on manufacturing 
plants (note that 2/3 of the sample was excluded because it 
corresponds to discarded assets (the price is zero) and the 
estimation procedures take this component into account, 
each in its own way). 

Since we want to use the data to infer statistics on the 
population of assets in production, we would like our data to 
have properties similar to those of a random sample drawn 
from that population. As we stated earlier, this is not the 
case because we only have the prices of assets i that were 
subject to a transaction at time ¢,, i= 1, ..., n. In effect, 
while we would like to have price ratios for various periods 
in the existence of a given asset i, the ratio is only available 
when there has been a transaction, something that occurs in 
a non-uniform manner over an asset’s service life. 

Consequently, we can ask ourselves what form the above 
distribution might have if it had been drawn from a sample 
in which the price ratio had been measured, for the same 
asset i at different times ¢. Our argument is that it should 
converge toward a uniform distribution. We will therefore 
seek to obtain a weighting that will help us recreate a 
uniform distribution of price ratios. This weighting will help 
us offset the lack of uniformity in the distribution of 
observations, which may impact statistical analyses such as 
linear regression. 
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Figure 1 Distribution of observations by price ratio, manufacturing plants 


3. Approach 


Our starting point is that price ratios can be considered 
empirical realizations of an unknown form of survival 
function. In service life models, the survival function 
expresses the probability that an entity with a limited service 
life will survive beyond a certain point on the time axis. 
Accordingly, it provides the same information as a 
distribution function (or Cumulative Distribution Function). 
We will let ,, be a random variable describing the service 
life of a unit of value incorporated in some asset. The value 
gradually erodes over time for as long as the asset is in 
service. The price ratio can therefore be interpreted as the 
surviving fraction that gradually becomes smaller and 
smaller. This fraction is written as S(y) and gives 


S(y)=1-F(y) 


where F(y)= P(r < y) 1s the distribution function, that 1s, 
the probability that a unit of value is lost before point y. 
Fundamental transformation theorems of probability 
laws provide the means for defining the inverse function of 
F(y) (Greene 1993 and Ross 2002). We let z = F(y) and 
assume that the inverse function F ' exists so that 
y = F''(z). This shows that there is a direct match between 
the space of y, bounded at 0 but infinite to the night, and that 
of F which is bound between 0 and 1. The distribution 
function of z is F(F'(z)) =z. The law that generates this 
distribution is a uniform distribution between 0 and 1. 
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This result is generally at the core of data generation 
processes like Monte Carlo simulations because the uniform 
distribution is often used when a random sample is being 
generated, followed by the application of the inverse 
function (Davidson and MacKinnon 1993). This approach is 
not always practical and indeed is sometimes patently 
impossible, especially if the inverse function,F~' is not 
explicit. This result has also been used in generalized 
remainder approaches, notably to build specification tests 
(Lancaster 1985). 

The result is that any random sample built using 
empirical realizations of survival proportion data must 
converge in distribution toward a uniform distribution. 

In the case of price data, intuition suggests that between 
the time of investment and that of disposal, the full range of 
relative prices must be covered by an asset in production. 
Initially, value depreciates faster and therefore there are 
more observations with short periods of time. This is offset 
by the fact that the corresponding reference on the time 
scale is also shorter. For example, it takes less time to move 
from 100% of the initial value to 90%, than from 15% to 
5% of the initial value. 

It is easy to verify these findings numerically using 
simulated data and we will not spend time on this. Rather, 
we will examine how this result can be reintroduced in the 
database to produce, at least partially, properties similar to 
those of a random sample. We can do this by simply 
imposing ex post on the empirical price distribution a 
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weight structure w, that ensures that the empirical 
distribution of the data, in the price space, is uniform. 
The empirical distribution of price ratios r is given by 
: was ds) 
fy) = 2 


n 


(1) 


where J;(y)=1 if the measured value 7, of asset i is less 
than or equal to y (specifically, 7, < y), and 0 otherwise, 
and n is the total number of observations. Note that if the n 
units of the sample are independent and _ identically 
distributed (i.id.), when n—>o, F (y) converges in 
probability to F(y), that is, f(y) —*> F(y) (Bickel and 
Doksum 1977). 

To obtain weight w, for each asset i, we simply 
distribute the sample in a given number H of intervals (or 
classes) of a fixed size on the scale of price ratios, and we 
assign the same probability m=1/H to each of these 
intervals. Since the price ratios are bounded by 0 and 1, we 
then have the interval h=1 given by [0, H™'], and for 
h=2,..., H, the intervals are given by ](h-1)H"', 
hH'}. A weight w, is then calculated in each interval h by 
the ratio m/i, where 7, is the empirical probability 
specific to interval h, producing 


1oJ n 
t, =— > 6, (4) = + 2 
»=7 L8H =% ) 
where 6,(/) =1 if r, eh, 0 otherwise. We then propose 
Tt 
La Se cae 
Tt, 
n 
= 3 
7 (3) 


for r,<¢h. Using these weights, the weighted empirical 
distribution of the price ratios r is given by 
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By writing ©", w, = Dj_,r",n/ Hn, =n, we finally get 
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When n->0, we have (1/n) 5”, §,(h)I,(y)—> 
P(reh, r<y) and (1/n)>",8.(h) "> P(r eh). Thus, 
when n> ©, 

P(reh,r<y) 

P(r eh) 

=P(rsylreh)=F(y|h) (7) 


F(y|h)p—> 


where F(y|h) is the distribution of price ratios r within 
interval h. 

For a sufficiently large n, H must be determined in such a 
way as to build the intervals h so that F.(y|h) is 
distributed approximately uniformly, /=1, ..., H. In other 
words, when n-—> oo, for a sufficiently large H, F(y|h) 
should have a uniform distribution on interval 4. Note that 
this argument was used by Dalenius and Hodges (1959) ina 
context of optimal stratification. In this case, the distribution 
F(y|A) is given by 


0 for y<(h-1)H™' 
F(y|h)=)Hy-h+1 for (A-1)H'<y<hH"' (8) 
1 fory>hH™'. 


Since F(y)=Dj.,F(y|h)/H, we have F(y)=y, 
which corresponds to the uniform distribution. We conclude 
from this that for a sufficiently large n, the use of weighting 
(3) should ensure that the weighted empirical distribution 
Fe( y) given by (5) is distributed approximately 
uniformly. 

Monte Carlo simulations have shown that estimates 
produced from a non-random sample could be improved by 
using this approach. Its main advantages can be attributed 
to: 

.- its simplicity; 

. the fact that it can be introduced ex ante, or prior to 
introducing the econometric model as __ such. 
Consequently, it does not require strong working 
hypotheses. 


If we go back to the histogram presented earlier and 
divide the sample in H = 5 intervals of a width of 0.2 anda 
value of t=1/5=0.2, we then get the following histogram 
that was weighted ex post. 


4. Application 


We will now illustrate our approach using an example 
taken from the Kelly Blue Book, a source of information 
widely used to estimate depreciation of automobiles. Table 
1 shows the prices of two models of cars at different ages 
between | and 18 years. For each car, we have a sample of 
n=18 units. Prices are expressed in relative value in 
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relation to a new model. The ratios also have to be adjusted 
to take into account the survival probability at each of these 
ages. For each vehicle, the final ratio used r, for year i is 
built from the product of the price ratio times the survival 
probability. 

We are interested in the average depreciation rate 7 for 
each car. This can be estimated from a regression of the 
prices (or from a function of these prices) in relation to age 
(or a function of age). However, if we assume that the rate is 
constant and geometric, we obtain the relationship 7 = 
1—t', where r, is the relative price based on age i. In this 
case, a rate t, can be estimated at each age i by 7, = 
1—r''. An estimate of the average rate of depreciation is 
then produced from the average for all ages, T = )/8)7, /18. 

In the above example, we see that the depreciation rates 
t, vary by age range and that they tend to increase with 
age. Moreover, the fact that we use a simple average of the 
ages in calculating t again implicitly gives the same 
weight to each age. However, it 1s quite clear that this is not 
the distribution that we would get from a random sample of 
service vehicles. The figure below shows the distribution of 
price cells between ratios of 0 and 1. 


The reweighting technique simply involves applying an 
equal weight to each of the relative price ranges. In this 
example, the n=18 ages are distributed into H =7 
classes, resulting in 18/7 of the ages in each class (in reality, 
the structures of the cells was configured into 8 classes but 
the last is always empty). As mentioned in Section 3, the 
individual weights w, for each age i are built using (3), that 
is, by dividing 18/7 by the number of observations found in 
each class, except for the empty cells where the weight 
remains zero. Table 2 shows the results and the impact of 
reweighting on the derived statistics. 

This example clearly illustrates the problems of aggre- 
gation bias typical of regressions estimated from economic 
aggregates without taking account the real distribution of the 
units at the micro level. Thus, it is quite clear that the units 
at 17 and 18 years would not have the same regression 
weight as those at 1 year because the risk of loss at 1 year 
affects almost all vehicles to be put into circulation, while 
very few of them will be exposed to the risk of loss of value 
at more advanced ages. The result is that the unweighted 
estimate in this example produces an over-estimation of the 
depreciation rate in the order of 15%. 
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Figure 2 Weighted distribution of observations by price ratio, manufacturing plants Ex post 


weighting 
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Table 1 Relative prices of two models of cars based on the Kelly Blue Book and average depreciation rates 


before weighting 


Pr(t>S)* 


0.9988 
0.9901 
0.9666 
0.9220 
0.8526 
0.7582 
0.6433 
0.5164 
0.3892 
0.2731 
0.1770 
0.1051 
0.0567 
0.0276 
0.0120 
0.0046 
0.0016 
0.0005 


Relative price Average depreciation rates 


Excluding disposals 


Buick 
0.8633 
0.7435 
0.6410 
0.5523 
0.4740 
0.4034 
0.3391 
0.2790 
02227 
0.1639 
0.1261 
0.0892 
0.0614 
0.0441 
0.0320 
0.0190 
0.0088 
0.0051 


Chrysler 
0.8257 
0.6801 
0.5608 
0.4621 
0.3794 
0.3087 
0.2482 
0.1953 
0.1491 
0.1050 
0.0772 
0.0523 
0.0344 
0.0236 
0.0164 
0.0093 
0.0041 
0.0023 


Including disposals 


Buick 
0.8622 
0.7361 
0.6195 
0.5092 
0.4042 
0.3058 
0.2181 
0.1441 
0.0867 
0.0448 
0.0223 
0.0094 
0.0035 
0.0012 
0.0004 
0.0001 
0.0000 
0.0000 


Chrysler 
0.8246 
0.6734 
0.5420 
0.4261 
0.3234 
0.2341 
0.1597 
0.1009 
0.0580 
0.0287 
0.0137 
0.0055 
0.0019 
0.0007 
0.0002 
0.0000 
0.0000 
0.0000 


Average 


Including disposals 


Buick 
0.1367 
0.1377 
0.1378 
0.1379 
0.1387 
0.1404 
0.1432 
0.1475 
0.1537 
0.1654 
0.1716 
0.1824 
0.1932 
01999 
0.2050 
0.2194 
0.2432 
0.2542 


OMI 2F 


Chrysler 
0.1743 
0.1753 
0.1754 
0.1755 
0.1762 
0.1779 
0.1805 
0.1846 
0.1906 
0.2018 
0.2077 
0.2180 
0.2284 
0.2347 
0.2396 
0.2534 
0.2761 
0.2867 


0.2087 


* Survival probability based on estimates from the Micro-Economic Studies and Analysis Division of Statistics Canada. 
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Figure 3 Distribution of cells used to estimate the average depreciation rate using data 
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Table 2 Relative prices of two models of cars based on the Kelly Blue Book and the average depreciation rate after weighting 


Relative prices Average depreciation rates Ex post weights 
Including disposals Including disposals 
Year Buick Chrysler Buick Chrysler Buick Chrysler 
] 0.8622 0.8246 0.1367 0.1743 2.5714 2.5714 
2 0.7361 0.6734 0.1377 0.1753 2.5714 2.5714 
3 0.6195 0.5420 0.1378 0.1754 1.2857 2.5714 
4 0.5092 0.4261 0.1379 0.1755 1.2857 2.5714 
S 0.4042 0.3234 0.1387 0.1762 2.5714 2.5714 
6 0.3058 0.2341 0.1404 0.1779 25014 1.2857 
of 0.2181 0.1597 0.1432 0.1805 1.2857 ee) 
8 0.1441 0.1009 0.1475 0.1846 1.2857 0.2338 
9 0.0867 0.0580 0.1537 0.1906 0.2571 0.2338 
10 0.0448 0.0287 0.1654 0.2018 0.2571 0.2338 
11 0.0223 0.0137 0.1716 0.2077 0.2571 0.2338 
12 0.0094 0.0055 0.1824 0.2180 0.2571 0.2338 
13 0.0035 0.0019 0.1932 0.2284 0.2571 0.2338 
14 0.0012 0.0007 0.1999 0.2347 0.2571 0.2338 
ES 0.0004 0.0002 0.2050 0.2396 0.2571 0.2338 
16 0.0001 0.0000 0.2194 0.2534 0.2571 0.2338 
17 0.0000 0.0000 0.2432 0.2761 0.2571 0.2338 
18 0.0000 0.0000 0.2542 0.2867 0.2571 0.2338 
Weighted 
average 0.1479 0.1836 
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Person-level and household-level regression 
estimation in household surveys 


David G. Steel and Robert G. Clark ' 


Abstract 


A common class of survey designs involves selecting all people within selected households. Generalized regression 
estimators can be calculated at either the person or household level. Implementing the estimator at the household level has 
the convenience of equal estimation weights for people within households. In this article the two approaches are compared 
theoretically and empirically for the case of simple random sampling of households and selection of all persons in each 
selected household. We find that the household level approach is theoretically more efficient in large samples and any 


empirical inefficiency in small samples is limited. 


Key Words: Contextual effects; Generalized regression estimator; Intra-class correlation; Sampling variance; Model- 


assisted; Household surveys. 


1. Introduction 


Many household surveys involve selecting a sample of 
households and then selecting all people in the scope of the 
survey in the selected households. Data on one or more 
variables of interest are collected for the people in the 
sample. There may be some auxiliary variables whose 
population totals and sample values are known; for example 
these may consist of population counts by geographic and 
demographic classifications. The generalized regression 
(GREG) estimator is often used to combine auxiliary 
information and sample data to efficiently estimate the 
population totals of the variables of interest. 

The GREG estimator makes use of a regression model 
relating the variable of interest to the auxiliary variables. 
The standard approach is to fit this model using data for 
each person in the sample (e.g., Lemaitre and Dufour 1987, 
first paragraph). This person-level GREG estimator is equal 
to a weighted sum of the sample values of the variable of 
interest, where the weights are in general different for each 
person. 

It is sometimes convenient to have equal weights for 
people within a household, for surveys which collect 
information on both household and person level variables of 
interest. The same weights can then be used for both types 
of variables. This ensures that relationships between 
household variables and person variables are reflected in 
estimates of total. If a household level variable is equal to 
the sum of person level variables (for example if household 
income is the sum of personal incomes), then the estimated 
total of the household variable will equal the estimated total 
of the person variable. This is not generally the case where 
separate weighting procedures are used for person and 
household variables. Similarly, if there is an inequality 


relationship between a household level variable and the sum 
of the person level variables, this will also be reflected in the 
estimates of the two variables. For example, the estimated 
number of households using child care centres should not 
exceed the estimated number of children using centres. 

The household-level GREG estimator achieves equal 
weights within households by fitting the regression model 
using household totals of the variable of interest and the 
auxiliary variables (e.g., Nieuwenbroek 1993). Weights with 
this property are called integrated weights. 

An alternative approach would be to use different 
estimation methods for household-level and person-level 
variables, and then make an adjustment to force agreement 
of estimates which should be equal. This approach is 
sometimes called benchmarking and has mainly been used 
to achieve consistency between estimates from annual and 
sub-annual business surveys (e.g., Cholette 1984). A 
benchmarking approach to household and _person-level 
variables from household surveys would require explicit 
identification of which person and household-level variables 
should have equal population totals. In this article we 
concentrate on integrated weighting and do not consider 
benchmarking approaches. 

Luery (1986); Alexander (1987); Heldal (1992) and 
Lemaitre and Dufour (1987) discussed a number of methods 
which give integrated weights for person-level and 
household-level estimates. However, none of these authors 
evaluated the impact on the sampling variance of calculating 
the generalized regression estimator at the household level 
rather than the person level. This is an important issue in 
practice because the cosmetic benefit of integrated 
weighting must be balanced against any effect on sampling 
efficiency. 
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This article compares the design variance, which is the 
variance over repeated probability sampling from a fixed 
population, of the person-level and household-level 
generalized regression estimators. In Section 2, we prove 
that the large sample variance of the household-level 
estimator is less than or equal to that of the person-level 
estimator, by showing that the former is optimal in a large 
class of GREG estimators. We show that this is because the 
household-level estimator effectively models contextual 
effects whereas the person-level estimator does not. In 
Section 3 the two estimators are compared for a range of 
variables in a simulation study. Section 4 is a discussion. 
Three theorems are proved in an Appendix. 


2. Theoretical comparison of person 
and household GREGs 


2.1 The generalized regression estimator 


In this subsection the generalized regression estimator is 
described for the general case of probability sampling from 
any population of units. Let U be a finite population of 
units and s CU be the sample. The probabilities of 
selection are m, = Pr[{i € s] for units i e U. Let y, be the 
variable of interest which is observed for units i € s. Let 
Zz, be the vector of auxiliary variables for unit i, which are 
observed for every unit in the population. The population 
totals of these variables are T, and 7, respectively. 

The generalized regression estimator of 7, is based on a 
model relating the variable of interest to the auxiliary 
variables: 


Eyl) = B’z, 
var,,Ly;] = om (1) 
y; y; independent for i # j 


where v, are known variance parameters. Subscripts “M7” 
refer to expectations under a model and subscripts “‘p” refer 
to design-based expectations, which are expectations over 
repeated probability sampling from a fixed population. For 
business surveys collecting continuous variables such as 
business income and expenses, v, are often modelled as a 
function of business size. For household surveys, the 
variable of interest is often dichotomous, in which case v, is 
usually set to 1 corresponding to a homoskedastic model. 

Usually z; have the property that there exists a vector 1 
such that 47z, =1 for all ie U. For example, this is true if 
the regression model (1) contains an intercept parameter. 


Definition 1. generalized regression estimator 


The generalized regression estimator for model (1) is 
defined as 


Statistics Canada, Catalogue No..12-001-XPB 
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rs Bea ey (2) 
where 


een, 


ies 


DD. =e Zi. 


ies 


and f isa solution of 


Yani'[y, - Bz, \z =0 
where c, are regression weights. (Often c, are set to 
Cc, = Vers) 

The coefficients B are calculated from a weighted least 
squares regression of y, on z; for ies. The GREG 
estimator has low design variance if the model is 
approximately true but is design-consistent regardless of the 
truth of the model (e.g., Sarndal, Swensson and Wretman 
1992, chapter 6). 

For large samples the design variance of 7. is 
approximately equal to 


var, (7,] = var, [7] (3) 
where 

fT. =1. +B" (T,-T;,) 
and B isa solution of 


D6 (7, - BT z;) 2; =0 

icU 
(Sarndal etal. 1992, Result 6.6.1, page 235). The coef- 
ficients B are calculated from a weighted least squares 
regression of y, on z; for i e U. The sample regression 
coefficients B are design-consistent for B. 


2.2 Person and household level GREGs 


We now consider the special case of household 
sampling, where the basic unit, i, is the person. Let x; be 
the p- vector of auxiliary variables observed for all people 
i ¢ U. The elements of x; may refer to characteristics of 
the person or of the household to which they belong. The 
population and sample of households will be denoted U, 
and s, respectively. The population of people in household 
g <U, will be denoted U, which is of size N,. Let 
Veil = View, Yi and x,, = Liev, Xj be the household totals 
of y, and x;. Let x, = x,,/N, be the household mean of 
Xe 
We consider the common case where households are 
selected by probability sampling and all people are selected 
from selected households, so that s = Leite Let 
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t,, = P[g € 5,]>0 be the propabilicy of selection for 
household g. It follows that x; = m,, for i € U,. 

The person-level GREG, ij i ‘the GREG under the 
following model: 


Ey ly) = px, 
var, [y,] =v, 07 (4) 


y, y, independent fori # j. 


So the person-level GREG, fe is given by substituting x; 
for z; in (2). Model (4) ignores any correlations between 
y, and y, for people i and j in the same household. 
These correlations were 0.3 or less in most of the variables 
considered by Clark and Steel (2002), although higher 
values occurred for variables related to ethnicity, such as 
Indigenous self-identification. Correlations of 1 could occur 
for environmental variables. Tam (1995) shows that the 
optimal model-assisted estimator for cluster sampling is 
robust to mis-specification of within-cluster correlations. 
One way of interpreting this result is that correlations within 
households are not relevant to estimating population totals, 
because all people are selected in selected households. So 
within-household correlations do not help to estimate for 
non-sample individuals, since the sampled and non-sampled 
people are in distinct households. 

A number of methods have been suggested for GREG- 
type estimation with equal weights within households. 
Nieuwenbroek (1993) motivated an _ estimator by 
aggregating model (4) to household level: 


Eu [Vor] = Boxy 
Vary [Yer] = Vo Pe (5) 
Ver Vg, independent for g # k. 


where v,, = Pie s¥ . The GREG estimator using sample 
data y,, oo ZES, bated on this model is 7, : 

Peny pe (Ty Tee) (6) 
is a solution of 


De Ty, 4 g yeni 


8 ES, 


where pi 
aT 
Byer) Xo} a 0. (7) 


The regression coefficient B,, is a household level 
weighted least squares oa of the sample values of 


Ve, ON Xp sia weights 1,4 - The values of a, could 
be set to. a If v,; =1 then v,,=N, so vind ing: 
Alternatively, a, = 1 could also be used. 


a, 
Several other equivalent integrated weighting methods 


have been used. Lemaitre and Dufour (1987) constructed a 
generalized regression estimator at person level, using xs 
instead of x; as the auxiliary variables. Nieuwenbroek 
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“lee commented that this is equivalent to (6) if 
c,=a,N, for ieU,. Alexander (1987) developed 
ciel relied weighting methods using a minimum 
distance criterion. 

Both the person and household level GREG can be 
written in weighted form >;., ih The weights for both 


estimators can be written as w, = 1, 'g, where 


hi Pc seo g ae cx, 


(igtea hal 0 raked 
for T, and 


4 SRG aay T \~! -1 
gral ato yp) (2 A, Te Xp 0) a, 1, Xpy 


§€5, 


for T,,, where person i 
(Superscript “-” 


belongs to household  g. 
stands for generalized inverse of a matrix). 


2.3. Theoretical results 


In this section, we show that 7,, has the lowest possible 
large sample variance in a class of estimators which also 
includes 7, p» for the sample design where households are 
selected by simple random sampling without replacement. 
We will then explain this result by showing that 77 pais 
equivalent to a regression estimator calculated using person 
level data, where the model includes contextual effects. 

For large samples, 7 p and 7 } can be approximated by 


Hs = Se + Fal * ihre 
and 
feet BAL tee) 


respectively, where B, and B,, are solutions of 


> oy, - Bpx;)x, = 0 
icU 

(8) 
Ds CONe BiX qi) Xi zi 


geU, 


(Sarndal etal. 1992, Result 6.6.1, page 235). Theorem 1 
states the minimum variance estimator in a class including 
T.and Tm 
Theorem 1. Optimal estimator for simple cluster 
sampling 

Suppose that m households are selected by simple random 
sampling without replacement from a population of M 
households, and all people are selected from selected 
households. Consider the estimator of T given by 


ea z J 
1 ne) Mis oy Jat Sermo BYE 


where fA is a constant p-vector. It is assumed that there 
exists a vector A such that A’x, =1 for all i ¢ U. The 
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variance of this estimator is minimised by h* which are 
solutions of 
Da (Voi - [se phe al | i 


SES, 


Hence 7,, with a, =1 forall g is the optimal choice of 
T. 


Theorem 1 has the perhaps surprising implication that 
ip 7 (with a, =1 for all g) has lower variance than i palor 
large samples. This is in spite of the fact that 7,, discards 
some of the information in the sample, because it uses the 
household sums of x; and y, The Theorem suggests that 
ie } 1s the appropriate GREG estimator for the cluster 
sampling design assumed here, and that the information 
discarded by summing to household level is not relevant 
when this design is used. To explain why 7,, can perform 
better than 7,, we will make use of a “linear contextual 
model” which is a more general model for £,,[Y,] than (4). 
The model is: 


Eyly,) = Yi X~¢ + 12%; (i € U,) 


vanely N= oo (9) 


y;, y, independent for i # j. 


Both x, and x; are used as explanatory variables for y, 
because the household mean of the person level auxiliary 
variables may capture some of the effect of household 
context (Lazarfeld and Menzel 1961). For example, if the 
elements of x, are indicator variables summarising the age 
and sex of person i then x, are the proportions of people 
in the household falling into different age and sex 
categories. If the population of interest includes both adults 
and children, then x, includes the proportion of children in 
the household, which could be relevant to the labour force 
participation of adults in the household. 

Theorem 2 shows that the improvement in the variance 
from using 7,, with a, =1 rather than using T, can be 
explained by the linear contextual model. 


Theorem 2. Explaining the difference in the 


asymptotic variances 
Suppose that households are selected by simple random 
sampling without replacement and all people are selected 
from selected households. Let r, = y, — B}x,, and let By 
be the result of regressing r, on ¥, over ie U_ using 
weighted least squares regression weighted by N,. Then 


var, iz al neNar, (7,,] = 


M"(1- 2) (M -1)" ll yen, xn) Be 


gel, 
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where 7,, is calculated using a, = 1 forall g. 


The result shows that the reduction in variance from 
using 7,, (with a, = 1) rather than 7, is a quadratic form 
in B.. Hence the extent of the improvement depends on the 
extent to which x, helps to predict y, after x, has already 
been controlled for, ie., the extent to which a linear 
contextual effect helps to predict 7, over i € U, using a 
weighted least squares regression weighted by N,,. 

The proofs of Theorems 1 and 2 are very much 
dependent on the assumption of cluster sampling. The 
results would not be expected to apply if there was 
subsampling within households. 

Theorems | and 2 only apply with a, =1 in the 
weighted least squares regression for 7,,. Other choices of 
a, are often used, for example it would often be reasonable 
to assume that v,, = N, in model (5), in which case it 
would be sensible to use a, = Nee Theorem 3 shows that 
T,, 18 equivalent to a person-level GREG estimator fitted 
under the linear contextual model for other choices of a,. 


Theorem 3. The linear contextual GREG 


For sample designs where all people are selected from 
selected households and ,, > 0 forall g € U,, T,, witha 
given choice of a, is the generalized regression estimator 
for model (9) where c; = a,N, for i € U,. 


Theorem 3 means that Te is the GREG under a more 
general model than 7,,. Nieuwenbroek (1993) showed that 
T 7 iS equal to a person-level GREG derived from 
regressing y, on x,. Theorem 3 states it is also equal to the 
person-level GREG from regressing y, on both x; and 


X,, thereby automatically incorporating any household | 
contextual effects. As a result, 7,, would be expected to’ 
have lower variance than ilps for large samples. (In the case 
of a, = 1, Theorem | stated that this is always the case). | 
For small samples, however, a more general model may be 
counter-productive. Silva and Skinner (1997) showed for 
single-stage sampling that adding parameters to the model 
can increase the variance of the GREG estimator, although 
this effect is negligible for large samples. It is possible that 
the contextual effects have little or no predictive power for 
some variables. In this case, it would be expected that 7, 
would perform slightly worse than Ts for small samples, | 
and about the same for large samples. 

The contextual model, (9), includes all of the elements of 
x, and all of the elements of x,. An alternative would be ; 
to use only those elements of either x; and x, which are 
significant, or which give improvements in the estimated 
variance of a GREG estimator. A GREG estimator based on 
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this type of model would probably have lower variance than 
the estimators considered in this paper, but would not give 
integrated weights unless the same elements of x; and x, 
were used. 


3. Empirical study 


3.1 Methodology 


A simulation study was undertaken to compare the 
person and household GREGs, i and T y> for a range of 
survey variables. We used two populations, consisting of 
187,178 households randomly selected from the 2001 
Australian Population Census and 210,132 households from 
the 1995 Australian National Health Survey. All adults and 
children in the households were included. The average 
household size was approximately 2.5. 

We selected cluster samples from these populations, 
where households were selected by simple random 
sampling without replacement and all people from selected 
households were selected. We simulated samples of size 
m = 500, 1,000, 2,000, 5,000 and 10,000 households. In 
each case, 5,000 samples were selected. The auxiliary 
variables x; consisted of indicator variables of sex by 
agegroup (12 categories). (This choice of x; means that the 
GREG estimation is equivalent to post-stratification.) The 
person-level GREG math C= eplk is), the household-level 
GREG with a, = = Ne. (T,,,), and the household-level 
GREG with a, = =1(T a) were all calculated. We also 
included the Fee estimator 

/ 


=N oy Tt 'y i 
which equals N/1 Yges Lieu, Yi for cluster sampling with 
simple random sampling of households, where 7 is the 
realized sample size of people. 

The variables include labour force, health and other 
topics. All of the variables are dichotomous except for 
income (annual income in Australian dollars, based on 
range data reported from the Census). “Employment(F)” is 
the indicator variable which is | if a person is employed and 
female, and 0 otherwise. The first six variables are from the 
Census population and the remaining five variables are from 
the health population. 


Dt, 


ies 


3.2 Results 


Table 1 shows the relative root mean squared errors 
(RRMSEs) of fe Te ipa and i, for a sample size of 
1,000 households. The RRMSEs are expressed as a 
percentage of the true population total. The biases have not 
been tabulated because they were a negligible component of 
the MSE in all cases. The percentage improvements in MSE 
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of 7,,, and T,,, relative to T, are also shown. The figures 
in brackets are the simulation standard errors of these 
percentage improvements. 

For this sample size, 7,,, and 7,,, performed slightly 
worse than 7 p for the health variables and slightly better for 
most other variables. The greatest gain was in estimating the 
number of sole parents; this variance was reduced by 10.8% 
and 16.3% by using the household-level GREGs. For all 
other variables, either the improvement was small or the 
household GREG was slightly worse than the person-level 
GREG. The inefficiency from using a household-level 
GREG rather than 7, was never more than 2.2%. 

Table 2 shows the percentage improvement in MSE from 
using 7,,, rather than 7, for different sample sizes. The 
simulation standard errors for each figure are shown in 
brackets. Table 3 shows the percentage improvements from 
using 7,,, rather than T,. The asymptotic percentage 
improvements (mm = 00) are also shown, based on the large 
sample approximation to the variance of a GREG. For both 
household-level GREGs, the percentage improvements are 
generally increasing as the sample size increases. For 
m = 500, the household GREGs are generally worse than 
the person GREGs, although never more than 5% worse. 
For m = 10,000, an improvement is recorded for over half 
of the variables. The greatest improvements were for 
estimates of the number of sole parents (11.5%) and 
employed women (4.2%); all other improvements were 
small. 7,,, and T7,,. never had variances more than 0.2% 
higher than iis for m = 10,000. Generally 7,,, performs 
better than 7 yw, for larger sample sizes, as would be 
expected from Theorem 1, but the reverse is true for small 
sample sizes. 

In practice estimates of subpopulation totals are often of 
as much interest as population totals. Table 4 shows the 
performance of the various estimators for age-sex domains 
(12 age categories) and region domains, for the sample size 
of 1,000 households. There were 49 regions in the census 
dataset. The health dataset did not contain a similar region 
variable, instead the socioeconomic quintile of the collection 
district (a geographical unit consisting of approximately 200 
contiguous households) was used as the domain. The 
domain estimators were produced by calculating weights 
from each estimator and taking the weighted sum over the 
sample in the domain. This is equivalent to the domain ratio 
estimator described in Case 1, Section 2.1 of Hidiroglou and 
Patak (2004). We have used this method because it is the 
most commonly used in practice, as it enables all domains 
and population totals to be estimated with a single set of 
weights, although more efficient domain estimators exist 
(Hidiroglou and Patak 2004, cases 2-6). 

In each case, the median RRMSE over the domains is 
shown. The table shows that there is not much difference 
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between the three GREG estimators. For age-sex domains, 
the household GREGs did slightly better than the person 
GREG for census variables and slightly worse for health 
variables. For region estimates, the household GREGs were 
slightly worse in all cases. Table 5 shows that the 
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households GREGs performed very similarly to 7, for a 
sample size of 10,000 households. It is worth noting that 
Theorem | and 2 do not apply to the domain estimators we 
have used. 


Table 1 Relative RMSEs for sample size of 1,000 households 


alcohol 


-4.22 (0.23 


-1.77 (0.16 


-0.77 (0.12 


-0.31 (0.08 


Variable RRMSE% ‘% improvement in MSE 
T Tp TH Ty TH Ty 
employed 2.62 2.09 2.09 2.10 0.20 (0.26) -0.28 (0.27) 
employed F 3.78 3.05 3.01 3.02 2.63 (0.33) 2.09 (0.33) 
income 2.56 2.20 2.19 219 1.04 (0.25) 0.75 (0.24) 
low income 5.04 4.87 4.89 4.90 -0.62 (0.20) -1.12 (0.22) 
hrs worked 3.08 2.54 W539 2.53 0.94 (0.28) 0.70 (0.28) 
sole parent 12:50 MIGEIZ omeme2.02. Spates 10.84 (0.62) 16.31 (0.49) 
arthritis Sse 4.50 4.53 4.53 -1.38 (0.17) -1.57 (0.18) 
smoker 4.73 4.57 4.60 4.61 -1.64 (0.18) -1.81 (0.20) 
high BPR 6.80 5.30 S230 5.36 -1.70 (0.17) -2.06 (0.18) 
fair/poor hlth 9.79 9.42 9.47 9.47 -1.16 (0.16) -1.07 (0.18) 
alcohol 4.81 4.66 4.70 4.71 -1.77 (0.16) -2.15 (0.18 
Table 2 Improvement in MSE of household GREG 7,,, compared to Tp 
Variable % improvement in MSE 
m = 500 1,000 2,000 5,000 10,000 © 
employed -0.65 (0.31) 0.20(0.26) 1.02(0.24) 0.90(0.21) 2.17(0.21) 1.85 
employed F 1220.37) ~-2.63:(0:33)  2.59(0.33)" 3:53, (0.31) “4:24 (0.31) 4.13 
income -1.53 (0.31) 1.04(0.25) 0.48 (0.24) 0.61(0.19) 1.43 (0.19) 1.07 
low income -2.45 (0.27) -0.62 (0.20) 0.02(0.18) 0.18(0.15) 0.00 (0.00) 0.65 
hrs worked -0.26 (0.34) 0.94(0.28) 1.72(0.27) 1.61(0.24) 2.64 (0.24) 2.12 
sole parent 7.81 (0.69) 10.84 (0.62) 10.74(0.61) 10.23 (0.57) 11.50 (0.58)11.21 
arthritis -3.01 (0.24) -1.38(0.17) -0.34(0.12) -0.08 (0.09) -0.13 (0.07) 0.08 
smoker -3.91 (0.25) -1.64(0.18) -1.02 (0.12) -0.26 (0.08) -0.06 (0.07) 0.16 
high BPR -2.93 (0.24) -1.70(0.17) -0.86(0.12) -0.31 (0.08) -0.04 (0.06) 0.08 
fair/poor hlth -3.67(0.25) -1.16(0.16) -0.71 (0.12) -0.05 (0.08) 0.03 (0.06) 0.10 


-0.21 (0.07) 0.14 


Table 3 Improvement in MSE of household GREG 7;,, compared to Tp 


Variable % improvement in MSE 
m = 500 1,000 2,000 5,000 10,000 00 

employed =i85 (O:55)* 0284027)" 1.25 (025)- MOS OZ) 2.22(0.21) @ 19s 
employed F 0.28 (0:39)11/2:091(0.33)9 2.71°(0.33)" °3:55 (0:29) 4.50) O0)r "4:31 
income -2.64 (0.31) 0.75 (0.24) 0.71(0.22) 0.90(0.17) 1.30(0.16) 1.37 
low income -3.15 (0.30) -1.12 (0.22) -0.15 (0.18) 0.06(0.15) 0.00(0.00) 0.94 
hrs worked = OE (O35) 0: 70(0.25) 46 LOS(025) LIS (OZ 2:57 (022 226 
sole parent 14.70 (0.53) 16.31 (0.49) 16.39 (0.47) 15.41 (0.44) 16.44 (0.44) 16.35 
arthnitis -3.31 (0.26) -1.57(0.18) -0.05 (0.13) -0.12 (0.09) -0.10(0.07) 0.16 
smoker -3.82 (0.28) -1.81 (0.20) -0.69(0.14) 0.21(0.11) 0.28(0.10) 0.57 
high BPR -3.20 (0.26) -2.06 (0.18) -1.12 (0.13) -0.40(0.09) -0.05 (0.07) 0.12 
fair/poor hlth  -4.02 (0.28) -1.07(0.18) -0.57(0.13) -0.09(0.09) 0.00(0.07) 0.15 
alcohol -5.00 (0.26) -2.15 (0.18) -0.82 (0.13) -0.49 (0.09) -0.29 (0.08) 0.18 
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Table 4 Median relative RMSEs for domain estimators for sample 


size m = 1,000 

Variable Age-Sex Domains Region Domains 

T Top Ty, Tr T Tp Ty, Ty 
employed 12°74 7.92, 7.93: B9OF29:89029792-43020 30:34 
employed F 13.12 8.32 8.36 8.34 34.64 34.65 35.03 35.16 
income 13.25 843 8.49 8.47 28.04 28.12 28.43 28.51 
low income 21.17 18.77 18.96 18.94 42.71 42.85 43.24 43.33 
hrs worked F456: “10:69 = 10.76 7 10:72 31.24 3712331252. 31.63 
sole parent 96.20 96.33 97.64 96.69 92.99. 93.30 94.37 93.50 
arthritis 24.94 20.94 21.12 21.11 13.31 12.94 13.02 13.04 
smoker S2AONZ9. 25-029. 39 22997 232. 122 3512538 
high BPR 27.01 23.80 23.97 23.95 15.83 15.31 15.44 15.45 
fair/poor hith 39.64 37.73 38.05 38.08 22.38 22.30 22.51 22.55 
alcohol 25539 2he4282) (53 21.58 12.73) 12.70. 12.80 12:82 


me Table 5 Median relative RMSEs for domain estimators for sample 


size m = 10,000 

Variable Age-Sex Domains Region Domains 

T, Tp Ty, Ty T, Tp Ty Tr 
employed BIT. 235 232) WES aS Some SESSIe Eto) ae Ot8G 
employed F SeS 6m 4seee 43) 2-42) 030m 1026F 1025 81025 
income 39 Te S82 ole S24 ene OM Oe San OO, 
low income 6:31) 25:63) 5.62 S16) 12°67, 12.68) ex69 12.69 
hrs worked AVI. 3 Ney Sells SP OG Ses eA O27) 
sole parent 28:40) 28,26) 28.29 2823) 27 27427 lon 27 etl 
arthritis TeA0L G26) 6:27 162 OS. ODES OO aE o> 
smoker SEES | tesaNsy  toeaker tenes © ey CoMe) Sey | Bioksy SoM 
high BPR 8.07 7.02 7.01 7.01 466 4.48 4.49 4.49 
faitipoor hith=7 11.69" 11302 11.02 11.01 6.75 6.69 6.69 6.69 
alcohol TAS 16:43, 6.43) C43 eSt87T 3.850 S85 woo) 

4. Discussion Acknowledgements 


The standard person-level GREG estimator produces 
unequal weights within households. Household-level GREG 
estimators can be used to give integrated household and 
person weights, which is beneficial for surveys collecting 
information on both household-level and _person-level 
variables. This article demonstrated that there is little or no 
loss associated with the practical benefit of integrated 
weighting arising from using a household-level GREG 
estimator. For large samples, the household-level GREG has 
lower design variance than the person-level GREG. For 
smaller samples there is at most a small increase in variance 
for some variables from using the household GREG, 
because this estimator is equivalent to using a regression 
model containing more parameters. Therefore, if integrated 
weights would improve the coherence of a household 
survey’s outputs, the household-level GREG can be adopted 
with little or no detriment to the variance and bias of 
estimators. 
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Appendix 


Proof of theorems 
Proof of theorem 1 
Let ¥,=T7,/M and X, =T7,/M be the population 


means of y,, and x,, respectively. The variance of T is 


var, [T] = var[T, + h’ (Ty — Ty_)] 
= va ds Oat h’x.,) 
§€5, 
_M?*(,;_m)¢° 
om ( a 5, 
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where S? = (M -1)"Ygeu Wg —A7X_ — (% — AT X)Y. 
To minimise with respect to h, we set the derivative of S’ 
to zero: 


0=(M-))'> 


gel, 
a im hi x, -; (¥, > h’ X,)} (Xp, a X,) 


0= § Ogewixg = a x) x, 


gel, 


-> {Op1 ie Y) é h" (x, _ X,)}X, 


geU, 


O= DY in — Wx -  - WX) x,, 


gel, 
0 = DY Wg — AX) X— — K — ATX,) Ty. (10) 


geU, 


We now show that (10) is satisfied by h*. By 
assumption, h* satisfies 


=o Ve 


ge, 


SM or: (11) 


Hence the first sum in the right hand side of (10) is equal to 
zero for h = h*. Premultiplying both sides of (11) by 47 
gives 


0= oe (Ve 


- lg) Nee 


gel, 

0= ye V1 itd ois) 
ge, 

0 =7,-TI hr. 


Dividing by M gives ¥, — XJ h* = 0. Hence the rest of 
the right hand side of (10) is equal to zero. So h’ satisfies 
(10). 

Proof of theorem 2 


Let “‘-” denote a generalized inverse of a matrix. Then 
B- is equal to 


ae {5 y N,e88} © d Nee" 


geU, ic, gel, icU, 


= > a >» X gil ei: (12) 


gel, geU, 


an Ti = T 
Now, 7, = y; — Bpx; SO, = Vg, — BpX,,. Hence (12) 
becomes 
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T T 
By = = XeiX gi a Xi(Ve1 — BpX,1) 
ge, geU, 
oi 
= > Xo1% g1 D2. X oi) g1 
ge, gevU, 
T T 
Ds XpiXgi ye Xp1X Bp 
gel, geU, 
Oct (13) 
: - T 
since B, = Wee KerX or} Lge, Lieu, X gi Voi: The 


difference in the variances is given by 


var, [T.] - var, [F}= M- (1 - mn) (M -1)'! 


| ye, (Ve 4 Bist). ¢ ye 5s o Bi, 


gel, gel, 


which becomes 


{var, [T,] - var, (Fas / ae (1 = mm) (M - "| 


a 2 if T D 
ae Pes 2 (7,1 + BpX,, - ByX,1) 
§eU, §eVy, 
2 T 2 
=o Dy hee Ds Vet pyPoxa) 
geU, ge, 
T T 2 T 2 
= ~~ (7,1 - Boxy + BoX,) — Ss (r,1 - BcX,1) 
gel, gel, 
T 2 T 2 
Fe » (r,1 ra BCX.) te ys (BoX,1) 
geU, geu, 
T T pT 
+2 2, Cee ee 
gEU, 
T 2 
ay. De (7, — BeX,1) 
ge, 
T T T T 
a oz BeX i XiBe + ZS (r — Bog) XpBe- (14) 
gel, geU, 


Now, B, is an ordinary least squares regression of r,, on 
Xp, SO 


Cy a BExp) Xx, = 0. 


ge, 
Hence (14) becomes 


var, [Tp] = varj[T pis 


M? m -1 pT T 
a-(1= | (M -1) Bey Xe Re 
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Proof of Theorem 3 


The GREG estimator is invariant under linear invertible 
transformations of the auxiliary variables. Hence model (9) 
can be re-parameterised to give 


Ey ly] = 1X, + $2 (%, — X,) (15) 
or equivalently 
Ey ly] = 6"; 
where 
Lees 
gj 3 aS 
X; —Xg 
and 
f- hi | 
>, 


The parameters in model (15) are related to those in model 


(9) by $, = ¥, + Y, and , = 7>. 
From Definition 1, noting that 


boc U U, 
§E5; 


for the assumed design, the generalized regression estimator 
under model (15) is 


1 pa = > 76 2; 


ieU ies 


=f, + >) DY 6x, + 6,05 - ¥)} 


gcU, icU, 
Bad cad ax) (16) 
ges, ieU, 


However, Lieu, (X; — ¥g) = 0 for each g. Hence (16) 
becomes 


‘eesniienee 9 LD Heahie OBL Wes 


gel, iceU, ges, icU, 
~ T. aT = aT =i pat 
=1,+9, D cs pau aes 
geU, icU, ZES, ieU, 
r aT ix aT =f ae 
= at 8, De Xg1 = Spe Mei Xgl 
gcU, ges, 
A rae g A 
= T, + $, (Ty - Ty): (17) 


Notice that (17) does not include the estimator of $,. The 
least squares estimators 
>, 
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are the solution of: 


my 7 C(); 7 $ z;) Zia) 


ies 


which is equivalent to: 


ies 


> 1; cl; — bX — 6, - zo), 2a ) anf 
i & 


By assumption, c, = a,N, so the first p elements of this 
equation are: 


a3 aT A 
Use Dy 2 m, a, N, x,y; ~ $)X¢ - 6,(%; et oy 


ges, icU, 
as -1 nes al fe aT : 
2 x, Meg Nex De Vi — Xe — o, (%; =X) 
&€5, icU, 


# ‘ha aT 
Wie S Tei a, Xg1 Wel — > %g1 — ,(Xp1 Xp1)} 


&€5, 


=f A 
0 rd > TN ol a, X 91 (Ver ag $1 X,1): 


§ ES) 


Hence 9, is a solution to (7). So the GREG estimator for 
model (9) is equal to 7,, provided that c, = a,N,. 
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Mean - Adjusted bootstrap for two - Phase sampling 


Hiroshi Saigo ' 


Abstract 


Two-phase sampling is a useful design when the auxiliary variables are unavailable in advance. Variance estimation under 
this design, however, is complicated particularly when sampling fractions are high. This article addresses a simple bootstrap 
method for two-phase simple random sampling without replacement at each phase with high sampling fractions. It works for 
the estimation of distribution functions and quantiles since no rescaling is performed. The method can be extended to 
stratified two-phase sampling by independently repeating the proposed procedure in different strata. Variance estimation of 
some conventional estimators, such as the ratio and regression estimators, is studied for illustration. A simulation study is 
conducted to compare the proposed method with existing variance estimators for estimating distribution functions and 


quantiles. 


Key Words: Double Sampling; Resampling; Variance estimation. 


1. Introduction 


Two-phase sampling or double sampling is a powerful 
tool for efficient estimation in surveys. Usually, a large- 
scale first phase sample is taken where auxiliary variables, 
correlated with the characteristics of interest and relatively 
easily obtained, are observed. Then, a small-scale sub- 
sample is chosen from the first phase sample to measure the 
characteristics of interest that are harder to obtain. At the 
estimation stage, the auxiliary variables at the first phase are 
employed to obtain an efficient estimator. 

A closed-form sample variance formula for an estimator 
can be complicated or even unavailable under two-phase 
sampling. Consequently, resampling methods, such as the 
jackknife and bootstrap, are appealing for two-phase 
sampling. Rao and Sitter (1995) and Sitter (1997) studied 
the delete-1 jackknife approach to the ratio and regression 
estimators under two-phase sampling and found the method 
provides design-consistent variance estimation with desir- 
able conditional properties given the auxiliary variables. 

A weakness of the delete-1 jackknife is that it cannot 
handle quantile estimation. Moreover, it is not trivial how 
one can incorporate the finite population correction into the 
jackknife variance estimation under two-phase sampling 
(see Lee and Kim 2002 and Berger and Rao 2006). The 
bootstrap, on the other hand, eliminates these problems if 
properly formulated. 

Several bootstrap methods for two-phase sampling have 
been proposed and studied. Schreuder, Li and Scott (1987), 
Biemer and Atkinson (1993) and Sitter (1997) considered 
similar bootstrap methods which provide consistent variance 
estimation when sampling fractions are negligible. Rao and 
Sitter (1997) proposed a rescaling bootstrap for high 
sampling fractions. 


A disadvantage of the rescaling approach is that it cannot 
handle the estimation of distribution functions and quantiles. 
In this paper, we propose a mean-adjusted bootstrap for 
two-phase sampling that accommodates the estimation of 
distribution functions and quantiles. The method is simple 
and includes the existing ones for negligible sampling 
fractions as a special case. Recently, Kim, Navarro, and 
Fuller (2006) studied replication variance estimation with- 
out rescaling for two-phase sampling in a more generalized 
framework than that of this paper. Our method, however, is 
different in that it internally incorporates the finite popu- 
lation correction. 

This paper is organized as follows. Section 2 presents the 
mean-adjusted bootstrap for two-phase sampling. Section 3 
illustrates how the proposed method works for some 
conventional estimators. A simulation for estimating distri- 
bution functions and quantiles is conducted in Section 4. 
Section 5 discusses further applications of the mean- 
adjusted bootstrap. Concluding remarks are given in 
Section 6. 


2. Mean - Adjusted bootstrap 


For notational simplicity, we assume there is only one 
stratum. To extend our method to stratified sampling, repeat 
the same procedure independently in different strata to 
obtain a bootstrap sample (see Rao and Sitter 1997, pages 
759-762). 

Let P be the set of unit labels in a population of size N. 
Suppose a simple random sample without replacement 
(SRSWOR) of size n,,, from P is taken and denote the 
sampled labels by 4 + B. The auxiliary variable (vector) x, 
is observed for i¢ 4+8. Then take a second phase 
SRSWOR of size n, <n,,, from A+B and denote the 
sampled labels by A. The characteristic (vector) y, 1s 
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measured for ie A. Let B=(A+B)-—A, nz =n,,, - 
Ny, Yq =ty;: 1 € A}, X, = {ee A}, and x, = {x;: 
j € Bs. An approximately design-unbiased estimator of 
parameter @ is assumed to be written as 6 =f (y bis 
X,)- 

Under the proposed method, a bootstrap sample is 
constructed as follows. 


1. Regard A as an SRSWOR of size n, from P. 
Choose n, units from 4 by a bootstrap method 
suitable for an SRSWOR of size n, from P. 
Denote the sampled labels by 4”. 


2. Regard B asan SRSWOR of size n, from P— A 
conditional on A having been selected. Choose n, 
units from B by a bootstrap method suitable for an 
SRSWOR of size n, from P- A. Denote the 
sampled labels by B’. 


309 Fort palsy 
where 


X4> 


define the mean-adjustment as X,, 


E,= x, + fyl%q — Xp) Sade ein C1) 


t gandtas sf ae “i 
with |X, =", LieaXn X= My Diea Xj, and J, = 


Dy LIN. 


4. Let Ye sified Aids x, ={x,:i¢ 4}, and 
= {t, 57 eB}. The eopene analogue of 6 
ne inna =HY ps Xo Xe). 


For bootstrap methods for a finite population, see Shao 
and Tu (1995, Chapter 6). The Bernoulli Bootstrap (BBE) 
proposed by Funaoka, Saigo, Sitter and Toida (2006) is 
appropriate for our method because of a reason specified 
later. To obtain a bootstrap sample A” in the BBE, we 
conduct random replacement for each i in A: keep 
(x,, y,) in the bootstrap sample with probability 
p={1-(-n7')' (-f,)}'" or replace it with one 
randomly selected from A. For the case where p ¢ [0, 1], 
see Funaoka et al. (2006). 

To estimate the variance of 6, repeat steps 1-4 a large 
number of times K and use 


K 
Vpoot (8) = KD) (84) — 05)" (2) 
k=! 


where Ps is the value of 6° in the k" bootstrap sample 
and 6", = kerer 

When /, is negligible, the mean adjustment (1) 1s 
unnecessary. The above method then reduces for large n, 
to that by Schreuder ef a/. (1987) and Sitter (1997). 

The proposed bootstrap method is motivated by the 
following two observations. First, let sampling schemes I 
and II be [P > 4+ B, A+B— A] and [P> 4, P- 
A — B], respectively, where — means “the right hand 
side is an SRSWOR from the left hand side.’ Then, I and II 
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implement the identical design. In fact, the design 
probability assigned to a particular sample {i = (i, i,, ... 
im ):edAnit Gavoenity¥A) < BropAl is Priie 4, 
best eee. T' = ny!ng"N — n,,,)!/N! 
while it is Prfic A, eB} = WO, <n G J = 
n,!n,\(N-n,,,)!/N! in IL Obviously, the ‘sampling 
horton of an estimator under repeated sampling 
depends on the sampling design. So, it is a matter of 
convenience to assume II is carried out even when I is 
employed. 

Second, to motivate the mean adjustment (1), observe 
that the mean of x of the set P-— A, or the conditional 
Ce: of xX, under repeated sampling given 4, is 

Xp_4 =(X - f,X,)/A- ae The bootstrap analogue of 
Xp_, iS given by es =. t= LE NO f jo 3 
equation (1) amounts "% X,= x, ms) Gm me Ae yg: @ Mean 
adjustment similar to that proposed by Rao and Shao (1992) 
in the context of hot deck imputation under the uniform 
response mechanism. This mean adjustment ensures 
appropriate correlations between x in A and x in B 
required for consistent variance estimation with high 
sampling fractions (see Rao and Sitter 1997, page 760). 
Note that the condition n, =n, or f, =f, 1s essential 
for cancelling out X in the mean adjustment. Therefore, the 
mean-adjusted bootstrap requires a bootstrap method for 
SRSWOR which retains the original sample size, such as 
the BBE. 

It is shown in Appendix A that the proposed bootstrap 
method provides design-consistent variance estimation for 
the class of estimators studied by Rao and Sitter (1997). 
Since no rescaling is performed, the method also works for 
estimation of distribution functions. Under some regularity 
conditions for the population distribution function, it 
provides design-consistent variance estimates for quantiles. 


b 


3. Illustrations 
3.1. Ratio estimator 


To illustrate, let us first consider the ratio estimator 
V,="4X4eg, Where ry = Y4/X4,W, =N,/N4,3, and 
X4sB = Wa X,+(1—wy,) X,. Let y, = (Vp /X%-) {wax + 
(l—w,)x,.}, the bootstrap analogue of y,. Using the 
results in Appendix A with h(y,, X,, Xg) = (V4/%4) 
{w,x, + (l— w,)X,}, we may approximate variance of y, 
under the proposed bootstrap method V.(y.) by 


wie aes ome ee 
VAD YS (yl el XO) Oa fa) SG, 
Ny 
pag an eee 
Er (Kyle PRY) On Sass) V4 Stes 
ih 
O fae) 2 (eta) hy Eu mai ry ie (3) 
Nase 7 ae ie ; “5 4) 
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where Sj, =(n,-1) Ziel Vy Pe = (ney 
Dies (Yi 74%) )(%— Fy), Se Sh yal) eeu ae) 
and Se = (ng - 1) Bay (X.- ~ Ho The night hand side of 
(3) can bd described as a “‘bootstrap-linearization” variance 
estimator. We denote it by v,, (¥,). Note that v,, (¥,) 1s 
almost identical to the jackknife-linearization variance 
estimator by Rao and Sitter (1995), 


- a 
ne eRe ae $3, 
aie ee z,) boss) ~ Save) i SM 
N4.B 
iy, 

C~ fas8) 8 “os Soon: (4) 
Nasp 

mhereee t Sin? =.(n og De Dip, = x}, ,)'g0 Which 


agrees with equation 4.8 of Demnati and Rao (2004), page 
25. Since they are close to vy (¥,), Vs(¥;,), its Monte Carlo 
approximation v,..,.(¥,) and vz, (¥,) should perform well 
not only unconditionally but conditionally on (X,,,/X,) as 
well. It is interesting to note that Taylor linearization in 
deriving vz, (y,) 1s performed around the sample means, 
not the population means (see the comment made by 
Demnati and Rao 2004, page 21). 


3.2 Regression estimator 


We next consider the regression estimator. The estimator 
of the population mean is yy, =¥,+6 aren — x4) = 
pt (eh ae) D(a, x), where 07 = So 4/Sc4 with si 
Ce Pies OF a) (Vn 4): Let Vi, =e. rile 
w,)b,.(X,. —X,-). Using the results in Appendix A (see 
also Appendix B), we have 
V.(y;,) = asic Ly) Meo 


nN4 


re (l= fase) » 2| (wa = Sa) Jy) so earl Zee 

N4a.p Lf) A 

is +2z, (J ti n 
Ng N4 
+ 2z, Oa Sus) b,M, 
Nass 

2 1— A 

+42; a) a,b, Xx, Sa (5) 


A 


where 24 =14(X4.9 —X%4) /t(n4 - 1) Su Ly Sieh) 
OP (raapy re? Pe) yy 10 OG, HX, “and 
a,= ¥,—5,x, We call the right hand side of (5) a 
bootstrap-linearization variance estimator of ¥,, and denote 
it by vz, (¥,). The jackknife-linearization variance 
estimator for 7, (Sitter 1997, page 781) is 


63 
& (bam!) : 
Val Vy.) = oo Mo2 eee bi cae 
A A+B 
mney (agere ia ple Py (x, —-X,)e 
Ny ieA (l-c,)° Ny ieA (1 ec; ) 
i 22 4b, (%) = X44); ~ Kaeo )E (6) 
Ny(Ng.p “ay ieA Wa c) 


where c, = n;' + (x, —¥,) /{(n, -1)S82,}, the leverage 
values. From (5) and (6), Vico (¥i-)> Yar (¥,-) and vy (¥;,) 
perform in a similar fashion conditionally provided that 
tarzg =, n, 1s large enough for all c, to be nearly zero 
and the last term on the right hand side of (5) is negligible. 


3.3. Estimation of distribution functions 


As an example, let us take the model-calibrated pseudo- 
empirical maximum likelihood estimator (ME) under two- 
phase sampling proposed by Wu and Luan (2003) defined 
by 


i Oo a on a) (7) 


i€A 


Where p, maximizes the pseudo-likelihood function 
i(p)=S4 (N/n,) logp, subject to (a) Ly p= 
1(0< p,<1); and (6) Ly pg; =NeLare 8; where 
g, = 2(x,, t)=P(y St|x,) under a certain working 
model. For example, we may assume log(g; /(1—- g;)) = 
x'@ with variance function V(g) = g(1— g). Chen, Sitter 
and Wu (2002) showed a simple algorithm for computing 
p;. It can be shown (see Wu and Luan 2003) that under the 
two-phase sampling considered in this paper, 


Fete) silos I(y, St) 


iéA4 


Ini 2 gn z;| per 0,(ny”), 


\ ic A+B ieA 


where B=Dp (g;-Z)/(yS/Lp (g,- 5) with 
Z=N'>p g, Note that this equation is not used in 
estimation, but it shows that the variance of Fy,-(¢) can be 
estimated by the mean-adjusted bootstrap since Fyy-(t) is 
approximated by a regression-type estimator. 


3.4 Quantile estimation 


Quantile estimation can be obtained by directly inverting 
F(t) by F(a) = inf {t: F(t) = a} for some a € (0, 1). 
For example, if (7) is used, then a quantile estimate is given 
by Yu, where y,,, is the k" order statistic of y such that 
Py ead = 5) Pye a (Chen! and Wu 2002). 
Under some conditions specified in Chen and Wu (2002), a 
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Bahadur-type representation for Fy-(a) can be established. 
Thus the mean-adjusted bootstrap variance estimator for 
Ft (a) is design-consistent. Note that no closed form 
variance estimator for Fyj-(c) is available, but a consistent 
variance estimator based on Woodruff’s interval estimation 
(Woodruff 1952) can be applied. 


4. Simulation 


4.1 Population and sampling 


A simulation study was conducted to examine the mean- 
adjusted bootstrap variance estimator for the estimators in 
Section 3. We report here the results for estimating 
distribution functions and quantiles. The results for the ratio 
and regression estimators are available from the author upon 
request. 

First, the auxiliary variable x for a finite population P 
of size N = 2,000 were generated as Gamma(1, 1). The 
characteristic variable y was then generated by y, = 
x, + ./x,v,, where v, ~ N(0, 0.5°). AnSRSWOR 4+B 
of size n,,, = 800 was taken from the population and then 
an SRSWOR A of size n, =200 was selected from 
A+B. The population was fixed throughout all simulation 
runs since we focus on design-based repeated-sampling 
properties. 


4.2 Estimation of distribution functions 


For the estimation of distribution functions, we took 
Fuc(t) as an example. Other estimators, e.g., Chambers 
and Dunstan (1986) and Rao, Kovar and Mantel (1990), can 
be handled similarly when an estimator is approximately 
design-unbiased. The working model for g in Fi,.(¢) was 
assumed to be logit with binomial variance. The bootstrap 
variance estimator v,,.,(Fup(t)) was calculated with K = 
200. The BBE was used in constructing a bootstrap sample. 
The total simulation runs were M = 5,000 while the true 
MSE of F,,-(f) ata given ¢ was estimated by 50,000 runs. 

We compared v,.,(Fye(t)) with three variance 
estimators: Wu and Luan’s (2003) analytical estimator, the 
standard delete-1 jackknife and an adhoc fpc-adjusted 
delete-1 jackknife. Wu and Luan’s (2003) estimator is 


e -1 -1) @2 -! a ao 
Va (Fue (2) = Otaee = NDS, A sep) Spd» 
where the two S? components are estimated respectively 
by 
ad 2 
ply ype oe u, 
POM GALE RE: 


] ; 
a ae >) Uj; be 
nN, (n, P l)pshi jéA 
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wheres? = {n,(my—-D}"Licgijen Vp and Bp = 
Veen, vl Leper withauy andjrg specified.as 
follows: For Ss v, =U; - I and u, = (8; -¢,) with 
I,=I(y, $t) and g, = &(x,, ft) estimated in A; For 
Sis ¥, = (Dis Dp? and up = 8) (aris ie ep 
with =D, = 1, — 8B, B= Lies 1)(8; — 84)! Lies (8) - 
&,) and 8, = 14 Lies & 
The standard delete-1 jackknife formula is given by 


: 4 . Ra 
a) (9) = Maced) a (O_» a Os) 


Nap JEAt+B 


where 6 = Fy,(¢), 6_,, is the j" jackknife pseudo- 
estimate and o = 1 eae Gren Note that for 7 € A, 
both y, and x, are deleted from the sample while for 
j€B, only x, is deleted (see Rao and Sitter 1995 and 
Sitter 1997). The adhoc fpc-adjusted formula is 
vin (Ee @)) =(1- fars)Yy (ye): 

Table 1 shows the relative bias (%Bias) and the 
coefficient of variation (CV) of the four variance estimators 
for Fuc(t,) (a =010360:25300.50;"0;7590190)) TWwhiere 
F(t,) =a. Here, %Bias and CV were calculated as 
%Bias = 100x(M'>™, v'’” —MSE)/MSE and CV = 
[M'y, (v'” — MSE)’]'’*/MSE, respectively, where 
v’”) is a variance estimate in the m™ simulation run. Table 
1 demonstrates that v,(F\4.(f)) is biased upward since the 
sampling fractions are not negligible, that v eV ovi () is 
biased downward since the adhoc adjustment factor 
(1— f,,,) 1S too small, and that both Vv, (Fue (t)) and 
Veoor(Fue(t)) are approximately unbiased although the 
latter is slightly more unstable, as is typical for a resampling 
method. 


Table 1 Variance estimation for the pseudo-empirical MLE 
Fue(‘a) 
a 
Estimator 0.10 0.25 0.50 0.75 0.90 
Veco (Fea). YeBiasnn = 027, =0:22 4 69064. O85. 2.73 
CV “O19 "O14" 0.148015 "0.24 


Va(F me (to) Bias, -429 -2.03 O42 los 326 
CV—017, 0:1 LAN 009-——O.1T + 0.19 
vy (Fue(ta)) %Bias 14.24 17.29 22.98 23.80 24.97 
CV 40:24 021)" 10254(02n O36 
Viel Mele)  %Bias -31.45 -29.63 -26.21 -25.72 -25.02 


CV" 033" 030" Cal OT ame 


Paralleling Royall and Cumberland (1981a, 1981b), we 
ordered the M = 5,000 simulated samples on the values of 
X4,, —X,4, Classified them into 20 consecutive groups of 
G = 250 in each of which the simulated conditional 
MSE(MSE,) and conditional mean of v(E.(v)) were 
computed. Figure 1 shows MSE. and E.(v) plotted 
against the group averages of X,,, —X, for t) 49 and fy 99. 
It is seen that both v,(Fiye(t)) and v4,.(Fiue(t)) behave 
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similarly conditioned on X,,, —X,. The jackknife variance 
estimators, v, (Fg (t)) and v Piss (t)), though biased, 
track a trend in MSE... 


4.3 Quantile estimation 


By directly inverting F,,,(¢), we estimated the a 
quantile. To obtain p, for Fi,.(t), we fixed t at 7,, where 
é, = inf {t:nj'D, 1(y, <4) = a}, an estimator using only 
{y,;:i € A}. For variance estimation, K = 1,000 bootstrap 
samples were created. For comparison, we also computed 
the Woodruff variance estimator (Woodruff 1952 and Shao 
and Tu 1995, page 238), 

A Z 2 
Vy (et (a))= Frc (046) «25 ¢)— Fu (a-C)-«/25¢) 
20a; 2 


where 6% = (Fe (t)) with t= Pye (a) and C,_,,. is the 
(1—«/2) quantile of N(0, 1). We let x = 0.05 although 
the best choice of « is unknown. The performance 
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M = 5,000 runs while the true MSE was estimated through 
50,000 simulation runs. 

Table 2 summarizes the results for quantile estimation. It 
demonstrates that the mean-adjusted bootstrap has an 
upward bias in estimating V(/j,(a)) while the bias in the 
Woodruff variance estimator is negligible. 


Table 2 Variance estimation for quantiles 


Qa 
Estimator 0.10 0.25 0.50 0.75 0.90 


Vooot(Fuie())  %Bias 6.27 14.32 10.05 10.02 10.28 
CV 0.53 0.53 0.51 0.52 0.61 
%Bias 1.64 3.75 2.92 0.70 -3.67 
CV 0.50 0.45 0.45 0.46 0.52 


vy (Ewe(@)) 


Figure 2 shows conditional properties of v,,.,(Fizz()) 
and vy(Fu.(a)) for a=0.10, 0.90. We see that both 
Vooot(Eup(a)) and vy(Fur(a)) track MSE, similarly 
although the former uniformly possesses an upward bias. 


0.00030 


00025 


MSE, and E, 
0. 


0.00020 
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-0.10 -0.05 0.00 0.05 0.10 
mean(x4+2) - mean(x4) 


(6) Fue (to.90) 


Figure 1 MSE, and Ev) for Fyp(t,) 
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Figure 2 MSE, and E(v) for quantile estimation 
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5. Further remarks 
5.1 Stratified two-phase sampling 


Suppose a population is to be stratified into H strata but 
no information for stratification is available. A possible 
solution for this situation is to first obtain an SRSWOR of 
size n’ from the population, observe auxiliary variables 
including the ones for stratification, stratify the sample into 
H strata, and in each stratum take an SRSWOR of size n, 
from n, units belonging to stratum / in the sample. See, 
for example, Cochran (1977, section 12.2) for details. 

Let N, be the size of stratum / in the population. 
Conditioned on n, >0, the first-phase sampling in stratum 
h described above is equivalent to simple random sampling 
without replacement of size n, in stratum h independent 
across strata. Thus, given n, (h=1,...,H), the mean- 
adjusted bootstrap can be applied independently in different 
strata to obtain a bootstrap sample. When N, is unknown, 
as is usually the case for stratified two-phase sampling, an 
unbiased estimator AY, = N(n,/n') can be used in the mean- 
adjusted bootstrap. In this case, the sampling fraction n'/N 
is used commonly throughout all the strata. 

Note, however, that the present discussion is legitimate 
for estimates conditioned on the first phase sample sizes. 
Variance due to the variable n, may be large. For 
unconditional variance estimation, see Kim et al. (2006). 


5.2 Non-response 


The above comment applies to imputed survey data 
under the uniform response mechanism. Let us suppose that 
a population is stratified into S, (A =1,...,H) where simple 
random sampling without replacement is undertaken 
independently. A sample is divided into imputation classes 
C,, (1 =],...,L) in éachxofi which the response rate is 
assume to be uniform and imputation is performed. An 
imputation class may cut across strata. We also assume 
which imputation class a sampled unit belongs to is 
correctly identified before imputation. Let us denote the 
numbers of sampled units and respondents in S$, 1C, by 
n,, and n,,, respectively. Then, it is seen that given n,, and 
r,,, the corresponding design in S, C, is the same as the 
one discussed in this paper if we regard the n,, units and 
r,, Tespondents as A+B and A, respectively. Therefore, 
the mean-adjusted bootstrap can be _ conducted 
independently in different S$,9C, (h=1l,..., H;/= 
I,”:-4 L). The size’ot S, 7vC,,” denotedaby “N;,/5°Can-be 
estimated by N,,=N,,(n,,/n,). Note that this is a boot- 
strap method conditioned on the number of respondents. 


6. Conclusion 


In this paper, we have proposed the mean-adjusted 
bootstrap for two-phase sampling. The method requires a 
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simple mean adjustment and can handle the estimation of 
distribution functions and quantiles because it requires no 
rescaling. The Taylor series expansion shows that the 
method has desirable conditional properties for the ratio and 
regression estimators. A simulation study demonstrates that 
it also has similar conditional properties in estimating 
distribution functions and quantiles. An extension to strat- 
fied two-phase sampling is straightforward. Conditioned on 
the first phase sample sizes, the method can handle stratified 
two-phase sampling and imputation under the uniform 
response mechanism. We are currently invesigating an 
extension of the proposed method to more generalized 
multi-phase sampling designs. 
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Appendix A 


In this appendix, we show that the proposed bootstrap 
method provides consistent variance estimates for a class of 
estimators considered by Rao and Sitter (1997). We use the 
same setting as in Rao and Sitter (1997) with slightly 
different notation. For simplicity, we assume there exists 
only one stratum, but an extension to stratified two-phase 
sampling is straightforward. 

Consider a class of estimators, 09=h(y ,, X4, X,), of a 
population parameter 0=h(Y, X, X), where Y and X 
are the population means of vectors y and x, ie., 
Y=N'Y,epy, and X =N'Y,-px, Here, x is observed 
in the first phase sample 4+8 whereas y is measured 
only in the second phase sample A. The sample means 
(¥, X44) and x, are calculated in A and B, 
respectively, iejusy ,=nj DeWy, w=, Lieeras and 
his Ne hes x}. 

By a Taylor expansion, we have 


6=0+VA(AY,, AX, AX;) +0,(n7'7), 


where Vh is the gradient vector of h evaluated at 
OY, XX x AVES pV cA eK Bee ee 
X, and ' means a transposed matrix (see equation 33.7 of 
Rao and Sitter 1997, page 757 and the required conditions 
therein). Then, the variance of 6= AG Sex eee te 
approximated by 


V(6)= Maly: Mb a VA, 
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where )\5x,x,) is the variance-covariance matrix of 
(¥',, ¥4, X,) under repeated two-phase sampling. Because 
A and B are SRSWOR’s of size n, and n, from the 
ey P, respectively, we see that diy x 
(i- fa) Sey, wy/M4 and Lz, =(1- f,)S;/ng, where st 
(N -1) DC -U)(u, ~Uy is the population variance 
of u=(y’, x’)’ or x and f, =n,/N. For Cov(y,, Xz), 
let E, and E,,, be the expectation for selecting an 
SRSWOR 4 from P and choosing an SRSWOR B from 
P-A given A, respectively. Note that E,),(x,) =(X — 
f,%,)/— f,). So, we have 


Cov(¥ » X,) = E(V 4Xp)— EVY ,)E(Xp) 
— EY sEpa(%_))-¥ X' 


=-S,,/N, 
where S,, =(N -1)'Diep(y - Y)(x - X). Similarly, 
Cov(X 4, Se =—SIIN. 
Now consider a Taylor expansion of §°=A(j,- 


x, ie -) with x =Xe + fy(X,-X,)/(l- ve the 
aes analogue of 6= ej Di Ss Xe Let E, and V, be 
the expectation and variance under the proposed bootstrap 
procedure, respectively. First, observe that E,(y ,-)=Y 4 
E,(X,-) =X, and 


E, (Er) =E, » Exgyy (Eo) 


= Eve (X, rf F(X -x,)/(1 =) 
= Xp» 
where E,, and E,,..,. are respectively the expectation 


with respect to sampling A” and the conditional expectation 
with respect to sampling B” given A” under the proposed 
bootstrap method. Then, 6° =h(y go X qo X_r) iS approxi- 
mated by 


6° =64+VA"(A te) 


Vi, AX, AX.) +0,(n, 


9 


where Vh" is the gradient: is h evaluated at (y Ve fob. Xp ea, 

AV e=Ve-Vyp AX, —X,, and AX. BX, 
(see equation 33.A.1 of Rao ad Sitter 1997, page 20 467 and 
the required conditions therein). Therefore, V,(6°) is 
approximated by 


V,(6°) = oedvers anergy Mok 


where D5 Gee EY is the variance-covariance matrix of 
(Vk eat) “under the proposed bootstrap sampling. 

Goncnien variance estimation under the proposed 
method is proved by showing Vh* and Yi5 FF. Fy are 
consistent for Vh and diy x xy> pee vei 
Consistency of Vh" for Vh follows from consistency of 
(V4, X4 Xz) for (Y,X,X) and continuity of h. 
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Consistency of Y(5,, FoF. Fy can be shown as follows. 
First, since we use a es Bind method suitable for simple 
random sampling without Pgh eet in Sera DT Ae 
we have La yo ai- IEPs xya/N4, Where $= 
C7 geet ae 4 ‘—i,)(u, —u,)' with u=(y’, x’). Sec- 
ond, because 


lL Le =F Var &,-))+V, -(E, Ay (x,.)), where 
Vay ae V4 ate respectively the variance with 


respect to sampling A” and the conditional variance 
oe respect to sampling B” given A’, 

2. Vea 2 @, )=(1- ea) S ies where Sia (ng - 
Dudes (x, —Xp_)(x;-X,) and fy, =n,/(N - 
ee and 

3: sure (¥y) = Tee) JA t,), we have 

= (I fe Senile + fy 82 ,1(N - na). Since 
bce §2, and §%, are consistent for S?, aA is 
consistent for L; = (1-/f; )S2/n,. Finally, we 
compute Cov,(¥,,X,.) and Cov,(X,,X,-). For 

the former, we have 


xy) a EY 4 a) -EY ») E,(%,.)' 
=E. Ve Ee CP ae 
Ee (Vy {xet+ fi (X4-Xe VA- Sf )})-F Xp 


=a SAIN, 


Cov, (Y 9 


where Sel (n,- Ds vies (y; - Y x; * ey Similarly, 
Cov, (X,., X-) =—-S§ Sa This completes the proof of 
omens of Lo. x. Fy for Ley, x, z- 


Appendix B 


In this appendix, we derive v,, (y,,). Under the mean- 
adjusted bootstrap, 


Vr=Va 
Cxiiazex a) a ore 
+(1—w,)b,. tes =Xalt xe Xa) 
asa Ue Dy 
Define 
a ¥ AR x; Yi» 
é i [Sie on ae Se X 5] 

and 


aly, Vp 2 aan x Vali eee oxi =vBe(e pi 


Note that b,. = i 7. Srpea ites ia Bi)! Let vp = h(&). 
This expression is slightly different from that in Appendix 
A, but we may exploit independent subsampling of A” and 
B*. Then, by Taylor linearization of 7; = h(&°) around &, 
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we obtain py, =y,+Vh"(&—&) and V.(p,)=Vh" YI. 

Vh", where é 

Vh* = Pp aw af) ees Oy golly glee ganas 
Z4—-246,,5,(1- w,)] 

and }:-= [v,] with 


a2 
Vin Cas ee 
V3, C4 Sea? 

a2 
V5) aly SD 


uy Slop pals Dy, P= Seba 
59 = C4 (14 DD 1 - EO -F a> 
M53 = C4 (M4 - I D0 i -Su)> 

V4 = C4 (4 “WD OF — Ean Fads 
Vy miey (tla Dage M6, ai oo Me ae 


Vy3 = C4 (M4 - De De, (x; — S20 i ¥i — S11) 


) 


a 2 2 
Vag = €, (ng —1) Dea (x; —§20)'s 
V5 = Vs7 =Vs3 = Vsq = 9, 
=i ater 
M55 age (N —1,)7-} Sop, 


Vv, =Vj, and c,=(1—f,)/n, Rewriting the moments 
from the orgin as the central moments, noting that 
¥, —¥,=5,(%; —¥ 4) +e, and using properties of e, as the 
least-squares residuals, we obtain the nght hand side of (5) 


after some algebra. 


References 


Berger, Y.G., and Rao, J.N.K. (2006). Adjusted jackknife for 
imputation under probability sampling without replacement. 
Journal of the Royal Statistical Society, B, 68, 531-547. 


Biemer, P.P., and Atkinson, D. (1993). Estimation of measurement 
bias using a model prediction approach. Survey Methodology, 19, 
127-136. 


Chambers, R.L., and Dunstan, R. (1986). Estimating distribution 
functions from survey data. Biometrika, 73, 597-604. 


Chen, J., Sitter, R.R. and Wu, C. (2002). Using empirical likelihood 
method to obtain range restricted weights in regression estimator 
for surveys. Biometrika, 89, 230-237. 


Chen, J., and Wu, C. (2002). Estimation of distribution function and 


quantiles using the model-calibrated pseudo empirical likelihood 
method. Statistica Sinica, 12, 1223-1239. 


Statistics Canada, Catalogue No. 12-001-XPB 


Saigo: Mean - Adjusted bootstrap for two - Phase sampling 


Cochran, W.G. (1977). Sampling Techniques. 3™ Edition. New York: 
John Wiley & Sons, Inc. 


Demnati, A, and Rao, J.N.K. (2004). Linearization variance 
estimators for survey data. Survey Methodology, 30, 17-26. 


Funaoka, F., Saigo, H., Sitter, R.R. and Toida, T. (2006). Bernoulli 
bootstrap for stratified multistage sampling. Survey Methodology, 
32, 151-156. 


Kim, J.-K., Navarro, A. and Fuller, W. A. (2006). Replication 
variance estimation for two-phase stratified sampling. Journal of 
the American Statistical Association, 101, 312-320. 


Lee, H., and Kim, J.-K. (2002). Jackknife variance estimation for 
two-phase samples with high sampling fractions. Proceedings of 
ASA Section on Survey Research Methods, 2024-2028. 


Rao, J.N.K., Kovar, J.G. and Mantel, H.J. (1990). On estimating 
distribution functions and quantiles from survey data. Biometrika, 
77, 365-375. 


Rao, J.N.K., and Shao, J. (1992). Jackknife variance estimation with 
survey data under hot deck imputation. Biometrika, 79, 811-822. 


Rao, J.N.K., and Sitter, R.R. (1995). Variance estimation under two- 
phase sampling with application to imputation for missing data. 
Biometrika, 82, 453-460. 


Rao, J.N.K., and Sitter, R.R. (1997). Variance estimation under 
stratified two-phase sampling with applications to measurement 
bias. In Survey Measurement and Process Quality: Wiley Series in 
Probability and Statistics. (Eds. L. Lyberg, P. Biemer, M. Collins, 
E. de Leeuw, C. Dippo, N. Schwarz and D. Trewin), New York. 
753-768. 


Royali, R.M., and Cumberland, W.G. (1981a). An empirical study of 
the ratio estimator and estimators of its variance. Journal of the 
American Statistical Association, 76, 66-77. 


Royall, R.M., and Cumberland, W.G. (1981b). The finite population 
linear regression estimator: An empirical study. Journal of the 
American Statistical Association, 76, 924-930. 


Schreuder, H.T., Li, H.G. and Scott, C.T. (1987). Jackknife and 
bootstrap estimation for sampling with partial replacement. Forest 
Science, 33, 676-689. 


Shao, J., and Tu, D. (1995). The Jackknife and Bootstrap. Springer- 
Verlag: New York. 


Sitter, R.R. (1997). Variance estimation for the regression estimator in 
two-phase sampling. Journal of the American Statistical 
Association, 92, 780-787. 


Woodruff, R.S. (1952). Confidence intervals for median and other 
position measures. Journal of the American Statistical 
Association, 47, 635-646. 


Wu, C., and Luan, Y. (2003). Optimal calibration estimators under 
two-phase sampling. Journal of Official Statistics, 19, 119-131. 


Survey Methodology, June 2007 
Vol. 33, No. 1, pp. 69-79 
Statistics Canada, Catalogue No. 12-001-XPB 


69 


On standard errors of model-based small-area estimators 


Nicholas Tibor Longford ' 


Abstract 


We derive an estimator of the mean squared error (MSE) of the empirical Bayes and composite estimator of the local-area 
mean in the standard small-area setting. The MSE estimator is a composition of the established estimator based on the 
conditional expectation of the random deviation associated with the area and a naive estimator of the design-based MSE. Its 
performance is assessed by simulations. Variants of this MSE estimator are explored and some extensions outlined. 


Key. Words: Composite estimation; Empirical Bayes estimation; Shrinkage; Small-area estimation. 


1. Introduction 


Design-based methods have over the years been proven 
to be inefficient for small-area estimation because, unlike 
empirical Bayes and related methods, they cannot make 
effective use of auxiliary information. However, the 
assumptions associated with the models that are applied 
remain a weakness of model-based methods because 
inferences based on them have the ubiquitous caveat of ‘If 
the model is valid ...’. In the application of empirical Bayes 
models to small-area estimation, the local areas (districts) 
are associated with random effects. In the design-based 
perspective, this assumption is not valid because in a 
hypothetical replication of the survey the same districts 
would be realised (except for some districts that happen not 
to be represented in the sample drawn), and the target 
quantities associated with them would also be the same. 
That is, the districts should be associated with fixed effects. 
The lack of validity in this aspect of empirical Bayes models 
has no adverse impact on estimation of small-area quantities 
(means, totals, proportions, and the like). Associating small 
areas with random effects is key to borrowing strength from 
or exploiting the similarity of the areas, as well as to doing 
so across variables, time points, surveys and other data 
sources, but it distorts the assessment of the precision of the 
estimators. Some composite estimators and estimators of 
their mean squared errors have the same deficiency. 

In the next section we diagnose this problem in detail, 
and in Section 3 propose a solution, which is then illustrated 
and assessed in Section 4 by simulations using a set of 
examples. These range from the simplest and most 
congenial (agreeing with most of the assumptions made) to 
more complex and realistic but least congenial, so as to 
explore the robustness of the method. Its fuller potential is 
discussed in the concluding section. 


2. Fixed and random 


By sampling variance of a general estimator 6 based on 
a given data-generating (sampling) process y we 
understand the variation of the values of 6(X) in 
replications of the processes that generate datasets X and 
apply 6 to them. In the design-based perspective, the 
replication of a survey of a country with its division to D 
districts yields the same district-level population quantities 
8,,d=1,...,D; these D quantities are fixed. In contrast, 
each replication in the model-based perspective, using 
empirical Bayes models, starts by generating a fresh set of D 
values @,, independently of the previous replications. 

We regard the design-based perspective as appropriate, 
because, in principle, each quantity 8, could be established 
with precision and a hypothetical replication of the survey 
would draw a sample from the same population, with the 
same division of the country into its districts and the same 
values of the recorded variables for each member of the 
population. Most established design-based methods are 
valid when the survey is based on a perfect sampling frame, 
which contains no duplicates and is exclusive for the studied 
population, and the sampling design is implemented with 
perfection, without any departures from the protocol. That 
is, the estimators they yield are (approximately) unbiased, 
the expressions for their sampling variances are correct, or 
nearly so, and these variances are estimated with small or no 
bias. 

In contrast, model-based methods carry a much heavier 
burden of assumptions that often cannot be verified. Various 
model diagnostic procedures are available, but they are all 
subject to uncertainty. Interpreting failure to find a contra- 
diction as evidence of absence of any contradiction is a 
commonly committed logical inconsistency. It can be 
overcome only by quoting properties of estimators when the 
assumptions are not valid, but such methods are difficult to 
develop because of a wide range of model violations that 
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one would have to take into account. Yet, despite these 
drawbacks, model-based methods have proven their worth 
in small-area estimation and are nowadays rightly regarded 
as indispensable (Ghosh and Rao 1994; Rao 2003; and 
Longford 2005). 

The EURAREA project (EURAREA Consortium 2004) 
carmied out a large-scale simulation study involving 
sampling from artificially generated populations that 
resemble the human populations of several European 
countries and application of several classes of estimators. It 
confirmed the superiority of model-based estimators, with 
several qualifications, but reported rather disappointing 
results regarding estimators of their standard errors. We 
trace this problem to an averaging applied in deriving the 
standard errors of shrinkage estimators. 

Suppose a population is divided into D districts, each of 
them of population size that can for all practical purposes be 
regarded as infinite, and independent simple random 
sampling schemes are applied in the districts. We assume 
that within each district d the outcome variable Y has the 
peat distribution with mean p, and the same variance 
oy, N(u4, Ow). For the within-district population means 
La ni assume the superpopulation model p, ~ 
Nw, o, .), but we want to make inferences about a fixed 
set of Geahsed) means {u,}. In Section 5, we discuss the 
more general regression setting defined by the within- 
district models 


(Y|d)~ N(X,B +8,, ow), 


in which X, are the within-district regression matrices, B 
the set of corresponding regression parameters common to 
the districts, and 6, is the deviation of the within-district 
regression from the typical regression defined by 5, = 0. In 
the Sep uaner 56, are a random sample from 
N(0, o. .), but we want to make inferences about the fixed 
Gieatised) set {6,}. Thus, we use model-based estimators, 
but assess their properties by design-based criteria. 

Denote Ss u the (national) mean of the qua HOLES Ly 
and by oF the district-level variance, Ca / Ds pe 
(u, —u)°. Note that they differ pom their respective 
superpopulation counterparts 1° and o, . We assume first 
that Ges OW and w are known. Let a and fi be the 
sample means of the variable of interest in district d and in 
the whole domain (country). They are based on samples of 
respective sizes n, and n=n,+---+n,. When no 
covariates are used the empirical Bayes (shrinkage) 
estimator of p1, is 


+ A, (1) 
l+n,@ 


where = 0%,/c,, is the variance ratio. The model; based 
conditional RennCe of u,, given the data, p, SG. and Ge, 
equal to o 5/( +n,@), 1s often regarded as the sampling 
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variance of fi ,; the origins of this practice can be traced to 
the application of the EM algorithm. A more careful 
derivation acknowledges that in the design-based 
perspective fi, is biased for p,, 

Hag —U 


E(i ibaa al Te rs 
d 


and its mean squared error is 


l 
MSE(i,; ty) = h- 7 


Nea 1a (1+7,0) 
22 2 
I etl he (2) 
(l+n,@) (l+n,o) 


assuming, for simplicity, that i=. To emphasise that 
MSE depends on the target, we include both the estimator 
and the target in its argument. In particular, MSE({i; p) # 
MSE(fi; n,), unless uw, =p. An inconvenient feature of 
the identity in (2) is that it involves p,, the target of 
estimation. If we replace (u,-—p)’ with its expectation 
over the districts, Gi. we obtain the more familiar identity 
2 


Se A oO 
MSE; )=—_—. (3) 
l+n,@ 


the EM-related conditional model-based variance of 1,. 
The bar over MSE indicates expectation (averaging) of 
(its 11)’, the numerator in the last term of (2), over the 
districts, with the sample sizes n, intact. Throughout, we 
condition on the within-district sample sizes n,,d= 
,..., D, even though in the sampling design each of them 
may be variable. MSE can be interpreted as model 
expectation, although the expectation or average of the 
squared deviations (1, —1)’ could be considered and 
estimated for a given set of districts without any reference to 
a model. The conditional variance in (3) is appropriate for 
districts with 1, in the ‘typical’ distance, o,, from the 
national mean p. When |p, —p|#6,, an unbiased esti- 
mator of the conditional variance o%/(1+,@) is biased for 
MSE(fi,;u,). As the bias is related to the population 
quantity 14, —y, it is not reduced by increasing the sample 
Sizer 


3. Composite estimation of MSE 


To estimate MSE(fi,;u,), we reuse the idea of 
shrinkage and combine the alternative estimators, o;,/ (1+ 
n,@) and a naive estimator of the MSE in (2). This 
composite estimator can be motivated as follows. If n, = 0, 
and therefore fi, =[i, we have no direct information about 
1, SO we cannot improve on o;/(1+n,@) as an estimator 
of MSE(ji,;,). When n, is large, 1, is estimated with 
precision sufficient for using (fi,—[i)’, possibly with an 
adjustment for bias, as an estimator of (u, =t)*... Kor 
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intermediate sample sizes, we search for a composition 
(compromise) of these two alternatives that are suitable in 
the extreme settings, when n, =0 and as n, >+0. We 
therefore derive expressions for their MSEs and then for the 
MSE of their combination. 

We regard the constant o;/(1+”,@) as an estimator, 
and refer to it as the averaged estimator of MSE. Although 
it has no variance, it is biased, with mean squared error 


2 
Og 


MSE: ; MSE(i,; H0| 


+n,® 


(1+ 1,0) 


le alee TN 
-{% (Ha a (4) 


The squared deviation (u, — u)°, involved in (2), is 
estimated naively by (f,- fi)’ with bias equal to 
Ow (n; =n')= oy/n, and, assuming that fi, 1s normally 
distributed, 
MSE {(fi, -fi)°s (Hy — BY} 
= var{(Hy — fi)’ Ia} 
+[E{(hy A)’ (Hy - HY [aT 


(uy - pb)” | 


(l+n,o)° 


204 o o. 
ee ie ee 
Ng d d 
2 2 
3 
=< Swiga(y pay? (5) 
Ng | Na 


derived from the properties of the non-central ° 


distribution and an approximation by letting n— +0. As 
an alternative, fi, may be used instead of f1,; elementary 
operations yield the approximations 

2 
Oo 
—Y + (un, —p) 
Ng 


EAC Fe fi)? lug} =( —b,) 


1-b,)" 
var{(iy =A) [n,) =P 
Nq 


Oy (20%) an H)°}, 
where b, =1/(1+n,@), and so 
MSE {(fi,-f)°3(uy -H)'} 

= var {(fi- f)"|mg}+ [E{(iy - 6)’ —-(4y- BH) Ma} 


3 4 
=(1-b,)* a 
d 
2 cit, 42 
+2(1 by)? (2 6by.+ 362) a 
Ng 
+b7(2-b,) (ug - B). (6) 
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This approximation is valid only for b, =1/(1+n,@), so 
further approximation is involved when we substitute a 
possibly suboptimal choice or an estimate of b, based on an 
estimate of «. In general, the coefficient b, that minimises 
the MSE in (6) differs from 1/(1+n,@) because the 
shrinkage with 6, =1/(1+n,@) is optimal only for targets 
that are linear transformations of 1, (Shen and Louis 
1998). We do not pursue this avenue because the solution, 
being a complicated function of the parameters, is likely to 
be sensitive to the error in estimation of the parameters. The 
estimator (fi, — i)’ could be corrected for its bias in 
estimating (u, —p)°, although this may result in a negative 
estimate, especially when 7, is small. 

Finally, we combine the two (biased) estimators of 
MSE(fi,; 4), the averaged estimator o,/(1+ 1,0) and 
the naive estimator derived from the identity in (2), using 
Gis fi)’ as an estimator of (La 11)’. The MSEs of these 
two estimators depend on (1, —)°, so we replace the 
relevant terms by their expectations across the districts d. 
We replace (pL, —) with Oh, and (u, —)* with 305 
or, in general, with KOn, where « is the kurtosis of the 
(district-level) distribution of 1. Although it may at first 
appear that we have not gained anything, because we still 
have to remove the dependence of MSE on (pL, — ut)’ by 
using o, instead, now we make this step at a later stage. In 
the simulations in Section 4, we show that this reduces the 
undesirable impact of averaging. 

Thus, we search for the coefficient c, that minimises the 
expected MSE of the composite estimator of the MSE, 


MSE(ji,; Hy) 
= (1—c, )MSE(fig; Wy) +c; MSE(fy; Ha) 
2 
(Oy A A 
= (1=¢;) (by) t=O bch (7) 
d 


To evaluate the MSE of this MSE estimator, as a 
function of c,, we use the expressions 


MSE {b,05; MSE({i,; M,)} = 25705, 


4 

sateen F 1c 

MSE{(f, = f)*3(H, = u)?} = rae ete 4n,®), 
d 


MSE{(fi, - f)?3(Hy — H)7} 


4 
= 2” (1 -b,)* +303 (2-b,) mo’ 
Ny 
+2(1—b,)’ (2 — 6b, + 3b; )n,o}, 
derived by averaging of the respective equations (4), (5) and 
(6) 5G 11)’ is replaced by oe and (u, —)* by 3O4, 
Assuming that the district-level targets 1, are normally 
distributed, the MSE of the composite estimator in (7) is 
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o he i 
E{(1-c,)(1-5,)° + (= 04 )05 (Ba =f): 
d 
a cybi05 a bi own rE b; (Ui Ha) ye 
SOE, U=c, )O,n,@rl—-c, Lt) 
+ ¢,05 (1+ nyo) — 55nJO- (Hy — BY} 


A 4 2 
Sh Ex(1 — Cy (By pi)’ + aCe ait —y)°} 


; 2ca., 40% 
= bi (l-c,) a = (uy -)? 
nq Ng 


2 23 


oO 
+05 (I cg)—™ + ey {55 — (Hy — HY} 
| d 


> 


using the identities (1-b,)° =binjm@ and of, =oy@ to 
extract the factor be By taking the expectation over the 
districts, keeping the sample sizes intact, we obtain 


MSE {MSE(fi,;11,)} 
4 
= = {(l—c,)°(3+4n,@)o4, + 2c5 n2 oo}. 
d 


The minimum of this quadratic function of c, is attained for 


: 3+4n,@ 
6) = ee 
3+4n,@+2n,o 


This choice of a coefficient c, agrees with our expectations. 
For n, =0,c) =1 and we rely solely on the averaged MSE 
estimator, equal to oj. Further, c*, is a decreasing function 
of n,, converging to zero as n, diverges to +o; for large 
n, we rely on the naive estimator of MSE. It is also a 
decreasing function of w; for @ =0, that is, G. =(sen—'| 
for every district d, confirming that p, = and p, would 
be estimated precisely if 1 were known. With increasing 
@, 6;,/(1+n,@) becomes less and less useful because the 
squared deviations (11, — uu)’ are widely spread (around 
Ge): 

If we adjust (f,- fi)’ for its bias in estimating 
(u, —p)°, the expected MSE of the shrinkage estimator is 
minimised for 

te G2 


Bee 
f (1+ n,@) 


It is easy to check that 
2 ee 
nw ] 
Ca eC} : 


. (l+ nj) 3+4n,0+2n5@° 


so the bias-adjusted estimator derived from (2) is assigned 
greater weight (equal to 1—c',) than the naive estimator 
would be. But the difference is small for all values of n,. 
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The composite MSE estimator based on (ji,-[i)’ is 
derived similarly, but the resulting expression is much more 
complex. The optimal shrinkage coefficient is 


cy =3(1--b,)*+2(1-b,)* f (by )ngo —b, (2-by) f (by) no 
x [3(1—b,)* +2(1-b,)? f (by ng - 
{2-4b,(2—b,)+3b; f(b, )}n2@° J, 


where /(b,)=2-6b,+3b;. The dependence on b, is 
particularly problematic, because in practice b, is estimated 
and the properties of the MSE estimator based on estimated 
c’’ are bound to be affected by the uncertainty about b,. In 
the derivations, we used the identity b, =1/(1+n,@), so 
this expression could not be used when the values of 5, are 
set a priori. 


4. Simulations 


Properties of the composite estimator of MSE cannot be 
derived analytically, and so we resort to simulations. We 
consider the artificial setting of a national survey with a 
stratified sampling design, with strata coinciding with the 
country’s 100 districts for which estimates of the means of a 
variable Y are sought. Simple random sampling is applied 
within each stratum, assumed to be of practically infinite 
population size. We have generated the values of the means 
uu, from the normal distribution N(p = 20, OF, = 8), and 
the sample sizes n, from scaled conditional beta distri- 
butions, given the means 1,, So as to inject a modicum of 
dependence of the means on the sample sizes. With this 
adjustment, the assumption underlying the averaged MSE 
estimator is false, but this could not be detected by a 
diagnostic procedure or a hypothesis test, not even with p, 
known. The sample size of one district was altered to be 
much greater than the rest, to represent the capital of the 
fictitious country. The within-stratum distributions of Y are 
N(,, Oy =100). The district-level means and sample 
sizes are fixed in the replications. For orientation, they are 
plotted in Figure 1. The districts are assigned order numbers 
from 1 to 100 in the ascending order of their sample sizes. 
The smallest sample size is n, =15 and the overall sample 
size is 3,698. 

In the simulations, comprising 1,000 replications, we 
generate the direct estimates fi, as independent random 
draws from N(,,0%,/n,) and the within-district corrected 
sums of squares as independent draws from _ the 
appropriately scaled ° distributions with n, —1 degrees of 
freedom. Then we evaluate the shrinkage estimator fi, for 
each district d, followed by evaluation of the averaged, 
naive and the two composite MSE estimators using the 
coefficients c’, and c}, or their naive estimates. 
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In the first set of replications, we assume that , ae and 
6; are known, so that the simulation reproduces the 
theoretically derived results and enables us to assess the 
quality of the composite MSE estimators without the 
interference of uncertainty about the shrinkage coefficient 
b, =V/(1+n,@). The results are summarised graphically in 
Figure 2. The empirical biases (their absolute values) of the 
four MSE estimators are plotted in the left-hand panel. 
Circles and black dots are used for the averaged and naive 
estimators, respectively, and the biases of the composite 
estimators are connected by solid lines. The absolute values 
of the empirical biases are plotted, to highlight their strong 
association with the sample size for the naive estimator. For 
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60 districts (60%), the composite estimator of MSE has a 
positive bias. For the naive estimator, this count or 
percentage is higher (78), and for the averaged estimator 
lower (52). Throughout, the main contributor to the bias of 
the averaged MSE estimator is the deviation of the squared 
distance (uu, — 4)” from the district-level variance o7,. The 
two composite estimators, based on (fi, - (i)? and on its 
bias-adjusted version, differ so little that their biases cannot 
be distinguished in the plot. The diagram shows that the 
averaged estimator of MSE entails substantial bias for a few 
districts, including several with large sample sizes. The 
biases of the naive and composite estimators are without 
such extremes. 


100 150 
Sample size 
The district-level sample sizes and population means of Y. Artificially 
generated values 


@ Averaged est. 
@ Naive est. 
« Composite est. 


Root-MSEs of MSE estimators 
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Figure 2 The bias and root-MSE of estimators of the MSE of the empirical Bayes small-area estimators. 
Based on simulations with an artificial setting. The bias and root-MSE of the composite 


estimators are connected by solid lines 
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In the right-hand panel, the root-MSEs of the MSE 
estimators are plotted, using the same symbols and layout. 
The diagram shows that the naive estimator is inefficient, 
especially for districts with the smallest sample sizes, 
whereas the averaged estimator is very efficient for some 
but inefficient for some other districts, without any apparent 
relation to their sample sizes. In fact, apart from sample size, 
high efficiency is associated with proximity of (yu, — uu) to 
o,, and low efficiency with the smallest and largest values 
of (4, -)°. For example, the empirical root-MSE of the 
averaged MSE estimator for district 1, with n, =15, is 2.63, 
whereas its counterpart for district 11 (n,, =16) is 0.049. 
Their population means are p1, = 24.55, exceeding 1 +0, 
by 1.72, and p,, =22.87, differing from +o, by only 
0.04. The root-MSEs of the naive estimator are 5.08 and 
3.51, and those of the composite estimator are 2.10 and 1.00 
for the respective districts 1 and 11. The composite MSE 
estimator performs much more evenly, moderating the 
deficiencies of the averaged and naive estimators. 

All three estimators are conservative (have positive 
biases) for districts with relatively small MSE of fi,. The 
averaged estimator has negative biases when the MSEs are 
relatively large. The composite estimator also has negative 
biases for such districts, but they tend to be smaller in 
absolute value. For districts with the smallest sample sizes, 
the composite estimator is not very effective because the 
naive estimator is very inefficient. For a few of these 
districts, the composition is counterproductive, as a result of 
averaging, but such districts cannot be identified from a 
single realisation of the survey. 

Next we study a less congenial setting, in which the 
normality assumptions of p., across the districts and of the 


o Averaged est. 
@ Naive est. 
+ Empirical MSE 


Mean of MSE estimates 
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elementary observations y,, within the districts are still 
satisfied, but the global parameters, 1, Ge and oa are not 
known and are estimated. We use the same means 1, and 
sample sizes n, as in Figure 1. The results of the simu- 
lations are summarised in Figure 3. In the left-hand panel, 
the empirical means of the MSE estimators are plotted, 
using the same symbols as in Figure 2, together with the 
empirical MSEs (crosses ‘+’) of the shrinkage estimators 
f.,. The empirical means of the averaged estimators have a 
regular pattern because the estimates in each replication 
depend only on the sample size n, and the estimated 
variance ratio ©. For biases, the naive estimators have a 
regular pattern, similar to their pattern in Figure 2. The 
naive estimators have positive biases that decline with the 
sample size. The averaged estimators are far too conser- 
vative; their means do not veer from the smooth trend. The 
composite MSE estimators deviate from this trend in the 
appropriate direction, but not to full degree. Their average 
bias is positive, equal to 0.22, or 10% (2.42 vs. 2.20), and 
they overestimate the target MSE for 70 out of the 100 
districts. 

The right-hand panel displays the root-MSEs of the MSE 
estimators. The naive estimator is inefficient, whereas the 
averaged estimator is very efficient for some and rather 
inefficient for other districts. The composite MSE estimator 
is more efficient than either naive or averaged estimator for 
36 districts; it is more efficient than the averaged estimator 
in exactly half of the districts, but it does not have its glaring 
weaknesses. As in the congenial setting (Figure 2), the 
differences due to bias adjustment of (fi , — fi)’ in compo- 
site MSE estimation (using coefficients c’, or c’*) are 


negligible. 
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Figure 3 The mean and root-MSE of estimators of the MSE of the empirical Bayes small-area 
estimators. The global parameters B, 0% and o} are estimated 
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Next we compare the MSE estimators for the district- 
level means of Y*/100, denoted by v,. The assumptions of 
normality both within and across districts are no longer 
appropriate. We apply the methods that rely on the 
normality assumptions, to assess the robustness of the 
composite estimators, but also to contrast the deficiencies of 
the averaging with the consequences of using ‘incorrect’ 
models. We chose the square transformation because the 


within-district expectations are known, equal to 
(ty 4 oy, )/ 100, and could be estimated by 
=a 1. —. 
100 
We denote by V, the empirical Bayes estimators applied to 


y,,/100. 

The results of the simulations based on the values of 
y,,/100 are presented in Figure 4, using the same layout 
and symbols as in Figure 3. The same conclusions about the 
biases and root-MSEs are arrived at as before, except that 
the naive estimator is even more inefficient and the perfor- 
mance of the averaged estimator even more erratic - it is 
both very efficient and inefficient for more districts than in 
the more congenial setting of Figure 3. The naive estimator 
is conservative, but for some districts with small n, far too 
much so, and its MSEs for these districts are very large. 

We contrast these conclusions with a comparison of 
estimating the district-level means of Y7/100 by ¥,, 
transforming the estimates fi, according to (8). The 
estimator 0, is more efficient than 0, for most districts 
(90, in fact), and when less efficient, the relative difference 
of their MSEs is less than 4%. For a few districts, the 
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difference in efficiency is perceptible, exceeding 20% for 
ten districts. However, the differences in the MSEs are small 
in comparison with the biases in estimating these MSEs, as 
shown in Figure 5. The biases and MSEs of , are marked 
by black dots connected to their counterparts for 1. 

Part of the lack of efficiency of ¥, is due to its bias; the 
bias of ¥, exceeds the bias of 0, for all but two districts, 
but the difference is non-trivial only when both estimators 
are positively biased. Thus, little efficiency is gained by 
arranging the analysis so that the distributional assumptions 
are satisfied. The gains are modest in comparison with the 
increase in the difficulty of estimating the efficiency, as 
expressed by MSE(¥V,;v,). Although the sampling 
variation of jis trivial in large-scale surveys, the 
contribution of MSE(ji,;}1,) to MSE(¥,,;v,) cannot be 
ignored. 

Figure 6 compares the composite MSE estimator with the 
naive estimator of MSE of {1 based on the empirical Bayes 
estimator of 1; it is derived by substituting fi, for yp, m 
(2). For brevity, we refer to it as the EB-naive estimator. As 
anticipated in Section 3, it tends to underestimate its target. 
It is more efficient than the composite estimator of MSE for 
about half the districts (52 out of 100), but its performance 
is more uneven than that of the composite MSE estimator. 
In principle, the EB-naive estimator could be improved by 
combining it with the averaged estimator; however, only 
minor improvement is made even in the congenial setting 
(known 04, and o;), and the composition is 
detrimental for several districts in the less congenial 
settings. Details are omitted. 
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Figure 4 The mean and root-MSE of estimators of the MSE of the empirical Bayes small-area estimators; 


estimation of the means of Y” /100 
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Figure 5 The biases and MSEs of estimators of v,. The vertical segments connect the quantities associated 
with ¥, and V,. The quantities associated with ¥, are marked by black dots 
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Figure 6 The bias and root-MSE of the composite and empirical-Bayes naive estimators of the MSE of ji, 


As a final simulation, we consider a binary outcome 
variable that indicates whether Y <5, so that the district- 
level percentages are in the range 1.5-18.8 and the 
dependence of the percentage on the variance within 
districts is substantial. The mean of the district-level 
percentages is 6.85; the substantial skew of these 
percentages (skewness coefficient equal to 1.01 and kurtosis 
to 3.78) provides a stern test of the method. 
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In the simulation, the district-level percentages are 
estimated by the univariate version of the shrinkage method 
described in Longford (1999 and 2005, Chapter 8). The 
results are summarised in Figure 7. The MSE is over- 
estimated by all three estimators for most districts, except 
for a minority for which the empirical MSE is several times 
higher than for the rest. The naive estimator has a substantial 
bias for most districts. The averaged estimator is less 
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regimented than for normally distributed outcomes because 
the shrinkage coefficient depends also on the estimated 
proportion, which is truncated from below at 2% to avoid 
zero estimated variance p (1—p,)/n,. The graph of the 
composite MSE estimates has the spikes for the appropriate 
districts, but the spikes are far too short to reduce the bias 
substantially. 

The MSEs of the averaged estimator are satisfactory for 
most, but are very large for several districts. For the latter 
districts, the naive MSE estimator is even less efficient. The 
composite MSE estimator is less efficient than the averaged 
estimator for many districts, but the difference is rather 
small, compensated by the gains in efficiency for districts 
for which the averaged estimator is less efficient. The EB- 
naive MSE estimator resembles in many features the naive 
MSE estimator; it is not plotted in the diagram. 

In conclusion, this simulation shows that when one of the 
MSE estimators, in this case the naive estimator, is very 
inefficient, it nevertheless contributes, even if very 
modestly, to the efficiency of the composite MSE estimator. 
The composite estimator draws on the best that the 
constituent estimators, averaged and naive, have to offer, 
even in uncongenial settings. A remaining challenge is to 
combine the naive and averaged estimators to satisfy a 
particular criterion which trades off the precision for 
districts that are estimated with high precision for higher 
precision in estimating in the districts with low precision. 
For example, we may be less concerned about estimation of 
the MSEs for districts with abundant representation in the 
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® Naive est. 
+ Empirical MSE 
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sample and much more about the sparsely represented 
districts. Also, some districts (e.g., those in a particular 
region) may be of specific interest, unrelated to their 
representation. Of course, the first step in this is the 
definition of one or a class of criteria that reflect the 
inferential priorities, and this is bound to be specific to each 
survey and client. See Longford (2006) for some proposals. 


4.1 Refinements and extensions 


Several elements of realism can be incorporated in the 
derivation of the composite MSE estimator. First, 
uncertainty about 1 can be reflected by acknowledging that 
fi, and fi are correlated. Thus, var(fi,— 1) = Oy (I/n, - 
I/n) and the approximation in (5) becomes equality when 
both instances of o4,/n, are replaced by o4,(I/n, —1/n). 
This brings about only a slight change when n, <n, the 
case for most districts. If the country has a dominant district, 
with sample size that is a large fraction of the overall sample 
size, then this adjustment might be relevant, but it has a 
negligible impact on MSE estimation because even direct 
estimation of the mean for the district is nearly efficient. 

A similar refinement can be applied to the empirical 
Bayes estimator of p.,. It amounts to replacing n, with 
V/(nj' -n')= n,ni(n—n,) in the coefficient b, = 
1/1 +n,@). The change is not trivial only for a dominant 
district, but for such a district shrinkage yields only minute 
improvement over direct estimation with or without this 
adjustment. 


Root-MSEs of MSE estimators (%) 


0 50 100 
Districts 
(in ascending order of sample size) 


Figure 7 The mean and root-MSE of the composite naive and averaged estimators of the MSEs of district- 


level percentages 
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Accommodating sampling designs that differ from 
stratified random sampling, and which associate subjects 
with sampling weights, generates in composite estimation 
no problems additional to direct estimation with such 
designs and weights, because we require only the sampling 
variances of fi,,f1 and functions of these. Similarly, 
exploiting auxiliary information by applying (empirical 
Bayes) regression 


Via =X jaB +59 + € ja, 

with independent random samples 8, ~ N(0,0;) and 
Eig ~ NCO, OW), amounts to replacing fi in (1) with the 
prediction x,B, where X, is the vector of means of the 
regressors for district d and f is the vector of regression 
parameter estimates. To see this, we express the empirical 
Bayes fit for district d as 
n,® ~ x ~ 

(1, —X,B) = 
+n,@ 


I+n, 1+n,@ 


n4@ 


‘b+ byt 8. 
1+n,@ 

Pfeffermann et al. (1998) discuss issues related to fitting 
empirical Bayes models to observations with sampling 
weights. Composite estimation uses direct estimators fi, 
and ji for the vectors of all the variables involved and their 
estimated sampling variance matrices; their evaluation is a 
standard task in sampling theory. An outstanding problem 
with empirical Bayes estimators arises when xX, is based on 
very few observations because the uncertainty about 1, is 
then inflated, even when the model fit is very good; if the 
vector of means x, were known (available from sources 
external to the survey), 1, would be estimated much more 
efficiently using x Avy Composite estimation bypasses this 
problem by searching for the combination of district-level 
means of auxiliary variables, whether known or estimated 
from the survey or from other sources, aiming directly to 
minimise the MSE of the combination (Longford 1999). 

The approach developed in Section 3 can be adapted to 
distributions other than normal straightforwardly, so long as 
the kurtoses required for evaluating the district-level 
variance of (u,— uu)’ and the sampling variance of 
(ho ut)” are known. In practice, kurtosis depends on the 
mean p,, creating difficulties that can be overcome only by 
approximations or averaging. Estimating proportions p, 
with dichotomous data is a case in point. We have 


ry Vy 
var{(p,- p) }=—4(1-3p, +373) 


Ng 
4y 6v 5 Vv 
+—4(1-2p,)(p, - p) + —*(p, - py --S, 
Ng Ng Ng 


where v, = p,(1— p,)/n, and p is the national proportion. 
The complex dependence on the poorly estimated p, 
presents an analytical challenge that does not have a 
universal solution. 
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Throughout, we assumed that the value of the variance 
ratio @ is known. In practice, w is estimated. It is difficult 
to take account of the uncertainty about @ analytically, but 
its impact on estimation of 1, and MSE(fi,;",) can be 
assessed by sensitivity analysis which repeats the 
simulations described in Section 4 for a range of plausible 
values of @. As one set of simulations takes about one 
minute of CPU time, this is a manageable computational 
task. One difficulty in such an assessment is that with an 
altered assumed value of @ the estimator fi, is changed, 
and so the target of the composite MSE estimator is also 
changed. An alternative informal approach considers the 
consequences of under- and over-stating the value of @. In 
estimating 1, it is advisable to err on the side of greater «, 
giving more weight to the direct estimator fi, (Longford 
2005, Chapter 8). For estimating the MSE of fi, we may 
prefer to err on the side of the more stable averaged 
estimator. That corresponds to increasing the value of the 
coefficient c’, and, as c) is a decreasing function of , to 
reducing the value of w used for setting c’,. Of course, this 
should be done in moderation, not to discard the contri- 
bution of the naive estimator of MSE altogether. 


5. Conclusion 


The approach developed in this paper applies the general 
idea of shrinkage to estimation of MSE of small-area 
estimators and reduces the impact of averaging, regarded as 
undesirable when viewed from the design-based _pers- 
pective, in which the country’s districts have fixed 
population quantities .,. We have focussed on improve- 
ment in estimation of the MSE for each district separately. 
In practice, improvement of estimation for some districts is 
more important than for others. Many surveys are designed 
for inferences other than small-area estimation, or take small 
areas into account in planning only peripherally, and so they 
may yield more than satisfactory estimators for some 
districts, typically the most populous ones, and less satis- 
factory for others, often the sparsely populated districts. In 
such a setting, relatively higher inferential priority should be 
ascribed to the latter districts. Shrinkage estimators of small- 
area means and proportions have this property, and the 
simulations documented in Section 4 indicate that 
composite estimation of MSE has a similar property, at least 
in relation to the averaged estimator. 

For a given size of the bias in estimating an MSE, we 
prefer the positive bias, because we regard understating the 
precision as statistically ‘dishonest’, whereas overstating it 
merely fails to present the estimate in the light it deserves - 
we undersell the results of our analytical effort. With this 
perspective, the optimal coefficient c, in (7) should not be 
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sought by minimising the MSE of the combination, but by a 
criterion that regards underestimation of MSE as an error 
more severe than its overestimation by the same amount. 
Finding a suitable criterion for this, for which optimisation 
is tractable, is an open problem. The composite MSE 
estimator derived in Section 3 tends to overestimate the 
MSE, but this is not by our design. 

We have experimented with ML and REML estimation; 
in the setting used for the simulations, the differences 
between the two approaches are minute. The advantage of 
unbiased estimation of the variance o%, is lost when 6%, is 
subjected to a non-linear transformation, and efficiency is 
maintained by transformations only asymptotically. How- 
ever, small-area estimation is a quintessentially small- 
sample problem. 

The approach presented in this paper illustrates the 
universality of the general idea of combining alternative 
estimators. The composite estimator exploits the strengths 
and reduces the drawbacks of the constituent estimators. 
Applying it is not detrimental when one of the estimators is 
far inferior to the other. As a form of averaging is involved 
even in the composite MSE estimator, it contributes to its 
robustness by ameliorating departures from the assumptions 
made in the theoretical development, such as_hetero- 
scedasticity and asymmetric (non-normal) within-district 
distributions. 

Incorporating inferential priorities, in effect, 
redistributing the precision in estimating the MSEs for the 
small areas, is an open problem. A similar problem, 
designing surveys for small-area estimation so as to ensure 
sufficient precision in the model-based perspective (with 
averaging) is addressed by Longford (2006). 
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Handling survey nonresponse in cluster sampling 


Jun Shao ! 


Abstract 


In surveys under cluster sampling, nonresponse on a variable is often dependent on a cluster level random effect and, hence, 
is nonignorable. Estimators of the population mean obtained by mean imputation or reweighting under the ignorable 
nonresponse assumption are then biased. We propose an unbiased estimator of the population mean by imputing or 
reweighting within each sampled cluster or a group of sampled clusters sharing some common feature. Some simulation 
results are presented to study the performance of the proposed estimator. 


Key Words: Nonignorable nonresponse; Random-effect-based nonresponse; Imputation; Collapsing clusters. 


1. Introduction 


Nonresponse exists in most survey problems. The proba- 
bility of having a nonrespondent in a survey item (variable) 
y typically depends on the unobserved value of y, which 
creates a great challenge in handling nonrespondents. Com- 
monly used procedures for handling nonresponse (such as 
reweighting and imputation) are all based on the assumption 
that nonresponse is ignorable conditional on an auxiliary 
variable. More precisely, 


P(y is a respondent | y, z) = P(yisarespondent|z), (1) 


where z is an auxiliary variable whose values are observed 
for all sampled units in the survey. That is, conditional on z, 
the value of y and its response status are statistically 
independent. Assumption (1) is referred to as the uncon- 
founded response mechanism by Lee, Rancourt and Sarndal 
(1994). Using the terminology in Rubin (1976), non- 
response under (1) is ignorable conditional on z. 

There are situations in which it is difficult to find a 
variable z to satisfy (1). The purpose of this article is to 
study a method of handling nonresponse when cluster 
sampling is used, assuming that a variable z satisfying (1) is 
not available. In cluster sampling, sampling is carried out in 
two stages; the first stage sampled units are clusters 
containing units that are sampled in the second stage. 
Cluster sampling is used because of economic consider- 
ations. It is necessary when no reliable list of the second 
stage units in the population is available (for example, there 
is no complete list of people but a list of households is 
available). Under cluster sampling, the variable of interest y 
may be decomposed as y=y+b+e, where pu is an 
unknown overall mean of y,b is a cluster level random 
effect (all units in the same cluster share the same random 
effect 5), and e is a within-cluster random effect. In many 
cases, the dependence of the value of y and its response 


status is through the unobserved cluster level random effect 
b: 


P(y 1s a respondent | y, b) = P(y isa respondent |b), (2) 


i.e., 1f b were observed, then we would have assumption (1) 
with z=b. For example, suppose that clusters are house- 
holds and a single person completes survey forms for all 
sampled persons in a household. It is likely that the response 
probability depends on the household level variable 5, not 
on the within household variable e. 

Assumption (2) was first used by Wu and Carroll (1988) 
in a health problem where the clusters have a longitudinal 
(repeated-measure) structure. They called (2) informative 
censoring (missing) and proposed a method under some 
parametric assumptions on the probability P(y is a 
respondent |b) and the distribution of y. Later, Little 
(1995) called this type of missing mechanism the non- 
ignorable randon+coefficient-based missing mechanism. 
Thus, assumption (2) will be referred to as nonignorable 
random-effect-based response mechanism. Since 6 is not 
observed, response mechanism (2) is actually nonignorable. 

For survey data, it is difficult to impose any parametric 
model on the distribution of y. Furthermore, it is also 
difficult to fit a parametric model for the response mech- 
anism under (2), since b is not observed. After introducing 
some details on the sampling design and our assumptions, 
we propose in Section 2 a method for the estimation of the 
population mean of y under response mechanism (2), 
without requiring a parametric model for the response 
mechanism. It is assumed that y follows a random (cluster) 
effect model, but there is no parametric assumption on the 
distribution of y. Results from a simulation study are 
presented in Section 3 for examining the performance of the 
proposed estimator. Some discussions are given in the last 
section. 
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2. Main results 


Let S be a sample of clusters of size n from a population 
P. Within the i" sampled cluster, let S, be the second 
stage sample of size m,; 22 from a population P. For 
sampled unit j¢S;, a survey weight w, is constructed 
(from the specification of the sampling design) so that when 
there is no nonresponse, Y = VjesX jes, WY, 1S an unbiased 
estimator of the population total Y on any variable y, i.e., 
E(Y —Y)=0, where y, is the y-value of unit / in cluster i 
Y=DViepL jer Y¥,, and E, is the expectation with respect to 
repeated sampling. 

Let y be the variable of interest. We adopt an imputation 
model approach, i.e., we assume that each y, in the 
population is a random variable with 


Vy =H tT b; + ei, (3) 


where 1, is an unknown parameter, b, is an unobserved 
cluster level random effect with mean 0 and a finite 
variance, e, is an unobserved within cluster random effect 
with mean 0 and a finite variance, and 5,’s and e, ’s are 
independent. Note that the distribution of y, may vary with 
(i, J). 

Let 6, be the response indicator for y, (6, =1 if y, is 
a respondent and 6, =0 if y, is a nonrespondent). We 
adopt the approach in Shao and Steel (1999), i.e., 5, is 
defined for every unit in the population and nonresponse 
mechanism is part of the model. Let 6, be the vector 
containing 6,,/¢S;, and y, be the vector containing 
y,, J € S;. We assume the following nonignorable random- 
effect-based response mechanism: for every sample, 


F815.) =F, (6:15), 1eS, (4) 


where P, is the probability with respect to the model and 
P.(&| 1) denotes the conditional distribution of € given n. 
That is, conditional on 6, y, and 6, are independent. 
(Unconditionally, they may be dependent.) We assume that 
the stochastic mechanism with respect to the model is 
independent of the sampling mechanism so _ that 
E,E,,(X)=E£,,E,(X) as long as X is integrable, where E,, 
is the expectation with respect to P.,. 
Furthermore, we assume that 


for any ie S, at least one 6, is 1. (5) 


That is, each cluster has at least one respondent. Without 
this assumption (or some other assumption), the population 
total Y may not be estimable. More discussion is given in 
Section 4. 

If we assume ignorable nonresponse, i.e., P,(8; = 
1| y,) = P,,(6,; =), then a commonly used procedure is to 
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impute each nonrespondent by the mean Dies ¥ jes) Wj ¥ij/ 
Dies Ljes 5, Which leads to the following estimator of 
Gs 


P= DD Sy wy, Wy 


ieS jeS, 
= EE (EL Om (6) 
ieS jeS, ieS jeS, 
Under assumptions (3)-(5), 
E.E,,(Y ,) 
Syomey [s > 5, wu, + 8 + “| 
ieS jeS, 
= EE. >. Ow l  Eeeal eon vt} (7) 
ieS jeS, ieS jeS, 


where the last equality follows from 
Ee (5, Wi é,) > a Bees (6, Wij ej | b, )] 
= E,,[E,, 05 wi | OE, (@; | 8;)]= 0 (8) 


under (4). The first term in (7) is equal to 


E, ae 


dD dy myhi 


ieS jeS, 


2 fee 


ieS jeS, ieS jeS; 


which is approximately equal to (when vis large) 


E, 1 


De 2u8y yb 


ieS jeS, 


E, Es 


EE" 


ieS jeS, 


5 Ea 


ieS jeS, 


E, Dh 


[DE mmee 


ieS jeS, 


£[ZE"e6)} 


Edm 


ieS jeS, 


ieS jeS, 


Note that 


E EolY) =i Ais) a H; = E, 


ieP jeP, 


DD Mii 


ieS jeS, 


Hence, either p,=p for all i or £,(6,) does not 
depend on (i, 7) implies that the expectation of the first 
term in (7) is approximately equal to the expectation of Y. 
However, £,,(5,,w5,) #0 in general, because 6, and 5, 
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are dependent. Thus, the second term in (7) is not 0 and, 
hence, Y, defined by (6) is biased under the nonignorable 
random-effect-based nonresponse. This bias does not go 
away asymptotically as n > 0 and/or m, > © for all i. 

Recognizing that the problem with Y, is that imputation 
is done over the entire sample whereas the nonresponse 
depends on a cluster level random effect, we can find an 
unbiased estimator by performing imputation within each 
cluster. This would have been a natural way of imputing if 
the cluster random effect b, were observed. If we impute a 
nonrespondent y, in cluster i by the cluster mean 
DV jes, 94, Wi ¥j/L jes, 5,,W,, then the resulting estimator is 


Y, =», My 85 WiVi> 


ieS jeéS, 


with 
mam 5 my | 2 5m) (9) 
iy jeS; 


Assumption (5) ensures that 7, 1s well defined. Note that 


EE, AY) 7 ELE, 


DL by Wy Mi 


ieS jeS, 


= AAD) Dy Hi 


ieS jeS, 


+ Eek... 


EES mh] 


ieS jeS, 


+ LADD b, 


ieS jeS, 


ECG: 


where the first equality follows from assumption (3) and the 
fact that, under assumption (4), result (8) still holds with w,, 
replaced by w,, the second equality follows from the 
definition of w, and the fact that 1, and 6, do not depend 
on j, and the last equality follows from £,,(b,) = 0. Hence, 
y, is an unbiased estimator of Y. 

Since imputation is done within each cluster, the esti- 
mator defined by (9) seems inefficient when some cluster 
sample sizes m, are very small. This worry, however, is not 
necessary in the case where w, =w, for all / (e.g., the 
second stage sampling is with equal probability). When 
w, =, for all j, imputation leading to Y, in (9) is actually 
done in a much larger class, a group of clusters sharing 
something in common. Let §; =A Des 5, be the 
response rate within cluster 7 and let 


G, =i S vm. =m, §,= kim}, 1=(k, m),k Sm.» (10) 
For each /=(k,m),G, in (10) is the group of sample 


clusters having the same m,=m and §,=k. If w, =w, 
for all j, then, for ie G, with / =(k,m), 
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Wy = yD jes, / Dies, 80s) 
ged jae lee Di 5%) | 


= w,/8; 
= w,/(k/m) 


= w, Dad mw.) | Sveg Bmw; | 
=W, das jase Waco Janes 5, 
en(Eon Do fEac Bet 


Therefore, imputation leading to Y, in (9) is actually done 
within each group G, when w,=w, for all j, ie, a 
nonrespondent in S, is imputed by the sample mean of the 
respondents in G,, Die, Ljes Oy Wy Vi /LieG, L jes, 9yj Wy: 

When w, varies with j for some i’s, some additional 
conditions are needed in order to combine clusters. A 
discussion is given in Section 4. 

We end this section with a discussion of variance 
estimation, since most surveys require a variance estimator 
for each point estimator. A variance formula or its 
approximation (as n— oo ) for i can be derived, which 
may require more details on the sampling design. When the 
first stage sample size n is large, m, < m for alli anda fixed 
integer m, and n/N is small, where N is the size of P, we 
can apply the adjusted jackknife method as described in Rao 
and Shao (1992). More precisely, we can follow the 
following steps. 


1. Create n jackknife replicates, where the i™ 


replicate is obtained by deleting the i" cluster and 
adjusting the weights to Why k #i,i=1,...,n, 
according to the sampling design. For example, if 
the first stage sampling is a stratified sampling, 
then w{) =w,, if k and i are not in the same 
stratum and w;) =n,W,; /(n, —1) if k and i are in 
the same stratum h, where 7, is the stratum size. 


2. Re-impute the nonrespondents in the i" jackknife 
replicate using the respondents in the i” jackknife 
replicate, i=1,..., 7. 


3. Compute Y,, the same as Y, but based on the i" 
re-imputed jackknife replicate, 7 =1,..., 7. 


4. Compute the jackknife variance estimator for Y. 
using a standard jackknife formula (e.g., Shao and 
Tu 1995, Chapter 6). For example, if the first stage 
sampling is a stratified sampling with H strata, then 
a jackknife variance estimator 1s 
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in, -1 2 l As : 

v= iegeappane sre Yor | 
h=l "y,  ieS, N kes 

where S, is the sample from the A” stratum and 

n, is the size of S,. 


3. Simulation results 


We now present some results from a simulation study to 
examine the performance of the estimators Y, and Y.. 

We create a finite population similar to the elementary 
school teacher population in Maricopa County, Arizona 
(Lohr 1999, pages 446-447). The finite population contains 
311 clusters (schools). In each cluster, the second stage units 
are teachers. The cluster size (the number of teachers) varies 
from 6 to 59 and, hence, the first stage sampling is an 
unequal probability sampling with probability proportional 
to cluster size. The first stage sampling is with replacement 
and the sample size is 31. The second stage sampling is a 
simple random sampling of size 6 (for any cluster) without 
replacement. 

For each teacher, the variable of interest is the minutes 
spent per week in school on preparation. The values of y,, 
for this variable in the simulation are generated according to 
model (3), where pu, 1s the mean minutes spent per week in 
school on preparation for the i" school, b, is a random 
effect of the i" school, and e, is a random effect of the 
j" teacher in the i school. The values of p1,’s are the 
sample means in the data set in Lohr (1999, pages 446-447), 
which vary from 25.52 to 42.18 with a mean of 33.76 and a 
median of 33.47. The value of b, is generated according to 
b, =8.31(X; —2), where xX, has the gamma distribution 
with shape parameter 2 and scale parameter 1. The value of 
e, is generated from the normal distribution with mean 0 
and standard deviation 2.27. The 5,’s and e,’s are 
independently generated. The values of y, =p, +), +e, 
are generated in each simulation run so that we can evaluate 
the biases and standard errors of estimators using joint 
probability under sampling and models (3)-(5). 

For sampled units, nonrespondents are generated 
according to (4) and (5). That is, each sampled cluster has 
one respondent and the response status of the rest of the 
sampled units in each cluster are independently determined 
by P(y, is missing | 5,) = e’'/(1+e""'). The mean non- 
response probability is 33.76%. 

For the estimation of the finite population mean, a 


A 


simulation of 1,000 runs shows that, when Y, is used, the 
bias, standard error, and root mean squared error are -2.89, 
1.32, and 3.17, respectively, and the relative bias 
E(Y, -Y)/E(Y) 1s -8.5%; when Me is used, the bias, 
standard error, and root mean squared error are 0.12, 1.81, 
and 1.82, respectively, and the relative bias E (Y, -—Y)/E(Y) 
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is 0.3%. This simulation result supports our theory, i.e., Y. 
is approximately unbiased but Y, is biased. In this case, Y. 
has a larger standard error than Y,, but Y, has a much larger 
root mean squared error than Y due to its large bias. 


4. Discussions 


Without the assumption that each sampled cluster has at 
least one respondent, the population total may not be 
estimable unless some other assumption 1s added. Under the 
nonresponse mechanism (4), when all observations in a 
cluster are nonrespondents, no information in that cluster 
can be recovered from observed data in other clusters unless 
some additional assumption is made. For example, one may 
assume that the population of clusters with no respondent is 
similar to that of clusters with | respondent, in which case 
one can collapse clusters by distributing the weights of 
clusters with 0 respondent to the weights of clusters with 1 
respondent. Another approach is to assume a model so that 
we can extrapolate results to clusters with no respondent. 

The results in Section 2 are given for mean imputation. 
Extensions to some other imputation methods are straight- 
forward. For example, if random hot deck imputation is 
considered, then our result leads to imputation within 
clusters (or G,’s). When there is a covariate x whose values 
are all observed, our result can be extended to regression 
imputation with model (3) modified to y,=a+ 
Bx, +5, +e,. For unit nonresponse, our result can also be 
applied to re-weighting, ie., adjusting weights within 
clusters (or G, ’s). 

Our method is imputation model based. We assume 
random-effect model (3) and random-effect-based response 
mechanism (4). If model (4) does not hold, then 
E,, (6, wie@,) #0 and our estimator Vf has a bias with a 
magnitude depending on the size of |£,(6, wje,)|- 
Similarly, Y, is not valid if model (3) does not hold. 

It is shown in Section 2 that the condition w, = w, for all 
j ensures that imputation is done within each G;, that is the 
group of clusters with the same size and response rate. For 
two-stage sampling, this condition is satisfied when the last 
stage sampling is with equal probability (e.g., simple 
random sampling without replacement). For three-stage 
sampling, model (3) should be replaced by y,, = 
H, +, +e,, and 5, in (4) should be replaced by 5,. The 
survey weight w,, satisfies w,, =w, as long as the last 
stage sampling is with equal probability and our result still 
holds. In two-stage sampling with w, varying with j, we 
may perform imputation within a group of clusters that have 
the same E,,(y,|6,). For example, suppose that, in addition 
to (3)-(5), u; =p,5,’s are independent and identically 
distributed (iid), and conditional on 6,, the components of 
5, are iid. Then E, (b, |§,) = E,,(b, | 5,) depending only on 
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the size of the cluster m, and 8,. Hence we can perform 
imputation within each G, defined by (10). 
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On an optimal controlled nearest proportional to size sampling scheme 


Neeraj Tiwari, Arun Kumar Nigam and Ila Pant ' 


Abstract 


The concept of ‘nearest proportional to size sampling designs’ originated by Gabler (1987) is used to obtain an optimal 
controlled sampling design, ensuring zero selection probabilities to non-preferred samples. Variance estimation for the 
proposed optimal controlled sampling design using the Yates-Grundy form of the Horvitz-Thompson estimator is discussed. 
The true sampling variance of the proposed procedure is compared with that of the existing optimal controlled and 
uncontrolled high entropy selection procedures. The utility of the proposed procedure is demonstrated with the help of 


examples. 


Key Words: Controlled sampling; Non-preferred samples; Quadratic programming; High entropy variance. 


1. Introduction 


In many situations, some samples may be undesirable due 
to administrative inconvenience, long distance, similarity of 
units or cost considerations. Such samples are termed non- 
preferred samples and the technique for avoiding these 
samples is known as ‘controlled selection’ or ‘controlled 
sampling’. This technique, originated by Goodman and Kish 
(1950) has received considerable attention in recent years 
due to its practical importance. 

The technique of controlled sampling is most appropriate 
when financial or other considerations make it necessary to 
select a small number of large first stage units, such as 
hospitals, firms, schools efc., for inclusion in the study. The 
main purpose of controlled sampling is to increase the 
probability of selecting a preferred combination beyond that 
possible with stratified sampling, whilst simultaneously 
maintaining the initial selection probabilities for each unit of 
the population, thus preserving the property of a probability 
sample. This situation generally arises in field surveys 
where the practical considerations make selection of some 
units undesirable but it is necessary to follow probability 
sampling. Controls may be imposed to secure a proper 
distribution geographically or otherwise and to ensure 
adequate sample size for some subgroups of the population. 
Goodman and Kish (1950) considered the reduction of 
sampling variances of the key estimates as the principal 
objective of controlled selection, but they also cautioned that 
this might not always be attained. A real problem 
emphasizing the need for controls beyond stratification was 
also discussed by Goodman and Kish (1950, page 354) with 
the objective of selecting 21 prnmary sampling units to 
represent the North Central States. Hess and Srikantan 
(1966) used the data for the 1961 universe of nonfederal, 
short-term general medical hospitals in the United States to 
illustrate the applications of estimation and _ variance 


formulae for controlled selection. Waterton (1983) used the 
data available from a postal survey of Scottish school 
leavers carried out in 1977 to describe the advantages of 
controlled selection and compare the efficiency of con- 
trolled selection with multiple proportionate stratified ran- 
dom sampling (meaning the sampling scheme in which 
instead of one stratifying variable, many variables each of 
which is associated with the variable of interest y, are used 
by cross-classifying the population on the basis of these 
variables) and found the controlled selection to perform 
favourably. 

Three different approaches have been advanced in the 
recent literature to implement controlled sampling. These 
are (i) using typical experimental design configurations, (i1) 
the method of emptying boxes and (ili) using linear 
programming approaches. While some researchers have 
used simple random sampling designs to construct the con- 
trolled sampling designs, one of the more popular strategies 
is the use of IPPS (inclusion probability proportional to size) 
sampling designs in conjunction with the Horvitz- 
Thompson (1952) estimator. To construct controlled simple 
random sampling designs, Chakrabarti (1963) and Avadhani 
and Sukhatme (1973) proposed the use of balanced 
incomplete block (BIB) designs with parameters v=N, 
k =n and i, where N is the population size and n is the 
sample size. Wynn (1977) and Foody and Hedayat (1977) 
used the BIB designs with repeated blocks for situations 
where non-trivial BIB designs do not exist. Gupta, Nigam, 
and Kumar (1982) studied controlled sampling designs with 
inclusion probabilities proportional to size and used BIB 
designs in conjunction with the MHorvitz-Thompson 
estimator of the population total Y(=X*%, y,, where y, is 
the value of the i unit of the population, U). Nigam, 
Kumar and Gupta (1984) used some configurations of 
different types of experimental designs, including BIB 
designs, to obtain controlled IPPS sampling plans with the 
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additional property cnn, S0, S00, for all i# j= 
1, ..., N and some positive constant c such that 0<c<1, 
where 1, and 7, denote first and second order inclusion 
probabilities, respectively. Hedayat and Lin (1980) and 
Hedayat, Lin, and Stufken (1989) used the method of 
‘emptying boxes’ to construct controlled IPPS sampling 
designs with the additional property 0< 1, S71;,i< j= 
1, ..., N. Srivastava and Saleh (1985) and Mukhopadhyay 
and Vijayan (1996) suggested the use of ‘tf -designs’ to 
replace simple random sampling without replacement 
(SRSWOR) designs to construct controlled sampling 
designs. 

All the methods of controlled sampling discussed in the 
previous paragraph may be carried out manually with 
varying degrees of laboriousness, but none has exploited the 
advantage of modern computing. Using the simplex method 
in linear programming, Rao and Nigam (1990, 1992) 
proposed optimal controlled sampling designs that minimize 
the probability of selecting the non-preferred samples, while 
retaining certain properties of an associated uncontrolled 
plan. Utilizing the approach of Rao and Nigam (1990, 
1992), Sitter and Skinner (1994) and Tiwari and Nigam 
(1998) used the simplex method in linear programming to 
solve multi-way stratification problems with ‘controls 
beyond stratification’. 

In the present article, we use quadratic programming to 
propose an optimal controlled sampling design which 
ensures that the probability of selecting non-preferred 
samples is exactly equal to zero, rather than minimizing it, 
without sacrificing the efficiency of the Horvitz-Thompson 
estimator based on an associated uncontrolled IPPS 
sampling plan. The idea of ‘nearest proportional to size 
sampling designs’, introduced by Gabler (1987), is used to 
construct the proposed design. The Microsoft Excel Solver 
of the Microsoft Office 2000 package is used to solve the 
quadratic programming problem. The applicability of the 
Horvitz-Thompson estimator to the proposed design is 
discussed. The true sampling variance of the estimate for the 
proposed design is empirically compared with the variances 
of the alternative optimal controlled designs of Rao and 
Nigam (1990, 1992) and uncontrolled high entropy 
selection procedures of Goodman and Kish (1950) and 
Brewer and Donadio (2003). In Section 3, some examples 
are considered to demonstrate the utility of the proposed 
procedure by comparing the probabilities of non-preferred 
samples and sampling variances of the estimates. Finally in 
Section 4, the findings of the paper are summarized. 


2. The optimal controlled sampling design 


In this section, we use the concept of ‘nearest propor- 
tional to size sampling designs’ to propose an optimal 
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controlled IPPS sampling design that matches the original 
mt, values, satisfies the sufficient condition 7m, <7,1, for 
non-negativity of the Yates-Grundy (1953) form of the 
Horvitz-Thompson (HT) (1952) estimator of the variance 
and also ensures that the probability of selecting non- 
preferred samples is exactly equal to zero. Before coming to 
the proposed plan, we briefly describe the Misdzuno-Sen 
and Sampford IPPS designs which will be used in the 
proposed plan for obtaining the initial IPPS design p(s). 


2.1 The Midzuno-Sen and Sampford IPPS designs 


To introduce the concept of IPPS designs, we assume 
that a known positive quantity, x,, is associated with the 
value of the i" unit of the population and there is reason to 
believe that the y, ’s are approximately proportional to x, ’s. 
Here x; is assumed to be known for all units of the 
population and y, is to be collected for sampled units. In 
IPPS sampling designs, m,, the probability of including the 
i unit in a sample of size n, is np,, where p, is the 
single draw probability of selecting the i" unit in the 
population (also known as the normal size measure of unit 


i), given by 
pith et wicils 2apmul 


>i if, 
j=l 


We first describe the Midzuno-Sen IPPS scheme and then 
discuss Sampford’s design. 

The Midzuno-Sen (MS) (1952, 1953) scheme has a 
restriction that the probabilities of selecting the i" unit in 
the population (p,’s) must satisfy the condition 


ue Begin ge 


PG a ae n 


Et =y2, QUSeN aseh) 


If (1) is satisfied for the p, values of the population 
under consideration, we apply the MS scheme to get an 
IPPS plan with the revised probabilities of selection, p, ’s, 
[also known as revised normal size measures] given by 


P N-1 
oN 54 


n-l 


PST 2 28... M2) 


Now, supposing that the s'" sample consists of units 
i,, 15, ..., 1,, the probability of including these units in the 
s"" sample under the MS scheme is given by 


PAS) = 0, ng 


1 * * * 
= aie OE eRe aay) (3) 


N 


n-1l 


However, due to restriction (1), the MS plan limits the 
applicability of the method to units that are rather similar in 
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size. Therefore, when the initial probabilities do not satisfy 
the condition of the MS plan, we suggest the use of 
Sampford’s (1967) plan to obtain the initial IPPS design 
p(s). 

Using Sampford’s scheme, the probability of including 
n units i,, i,, ..., i, inthe s‘” sample is given by 


p(s) — I iy. cay loy 


HIER Mie Mes sare 1S } (4) 
i : u=l : 
where K, =(d",tL,_,/n')', A,=p,/A-p,) for a set 
S(m) of m<WN different units, i, 7,,..., 7,, and L,, is 
defined as 
Eyam hs Aas, WO SION)! 


S(m) 


m? 


2.2 The proposed plan 


Consider a population of N units. Suppose a sample of 
size n 1s to be selected from this population. The single 
draw selection probabilities of these MN units of the 
population (p,’s) are known. Let S and S, denote 
respectively, the set of all possible samples and the set of 
non-preferred samples. 

Given the selection probabilities for MN units of the 
population, we first obtain an appropriate uncontrolled IPPS 
design p(s), such as the Midzuno-Sen (1952, 1953) or 
Sampford (1967) design, as described in Section 2.1. After 
obtaining the initial IPPS design p(s), the idea behind the 
proposed plan is to get rid of the non-preferred samples S, 
by confining ourselves to the set S—S, by introducing a 
new design p,(s) which assigns zero probability of 
selection to each of the non-preferred samples belonging to 
S,, given by 


p(s) forse S— S, 
I=) PG) 
Po(s)= se, (5) 
0) otherwise, 


where p(s) is the initial uncontrolled IPPS sampling plan. 
Consequently, p,(s) is no longer an IPPS design. So, 
applying the idea of Gabler (1987), we are interested in the 
‘nearest proportional to size sampling design’ p,(s) in the 
sense that p,(s) minimizes the directed distance D from 
the sampling design p,(s) to the sampling design p,(s), 
defined as 
Diprippe By et shan ase IESG) 
Po ses—S, Po (S) 


subject to the following constraints: 


89 
Gi)" p,G)20, 
Gi) >) p(s)=1, 
se 525, 
(iii) Di pi(s)=n,, 
(eons U, . and (7) 


Sore 


(v) 5 p(s) S17. 


SI} 


The ordering of the above five constraints is carried out 
in accordance with their necessity and desirability. 
Constraints (i) and (11) are necessary for any probability 
sampling design. Constraint (iil), which requires that the 
selection probabilities in the old and new schemes remain 
unchanged, which ensures that the resultant design will be 
IPPS. This constraint is a very strong constraint and it 
affects the convergence properties of the proposed plan to a 
great extent. Constraint (iv) is highly desirable because it 
ensures unbiased estimation of the variance. Constraint (v) 
is desirable as it ensures the sufficient condition for non- 
negativity of the Yates-Grundy estimator of the variance. 

The solution to the above quadratic programming 
problem, viz., minimizing the objective function (6) subject 
to the constraints (7), provides us with the optimal 
controlled IPPS sampling plan that ensures zero probability 
of selection for the non-preferred samples. The proposed 
plan is as near as possible to the controlled design p,(s) 
defined in (5) and at the same time it achieves the same set 
of first order inclusion probabilities m,, as for the original 
uncontrolled IPPS sampling plan p(s). Due to the 
constraints (iv) and (v) in (7), the proposed plan also ensures 
the conditions m,, >0 and 1, <7,m, for the Yates-Grundy 
estimator of the variance to be stable and non-negative. 

The distance measure D(p,, p,) defined in (6) is 
similar to the °-statistic often employed in related 
problems and is also used by Cassel and S=rndal (1972) and 
Gabler (1987). Other distance measures are also discussed 
by Takeuchi, Yanai and Mukherjee (1983). An alternative 
distance measure for the present discussion may be defined 
as 


(Po = Pi)” 
DQy Pi) = isa ee (8) 
ra 2 (Po + Pi) 
When applied on _ different numerical problems 


considered by us, we found that the use of (8) gave similar 
results to (6) in convergence and efficiency and so we will 
give results using (6) as the distance measure. 

While all the other controlled sampling plans discussed 
by earlier authors attempt to minimize the selection 
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probabilities of the non-preferred samples, the proposed 
plan completely excludes the possibility of selecting non- 
preferred samples by ensuring zero probability for them and 
at the same time it also ensures the non-negativity of the 
Yates-Grundy estimator of the variance. However, in some 
situations a feasible solution to the quadratic programming 
problem, satisfying all the constraints in (7), may not exist. 
Constraint (v) may then be relaxed. This may not guarantee 
the non-negativity of the Yates-Grundy form of the variance 
estimator. However, since the condition 1, <1,1, is 
sufficient for non-negativity of the Yates-Grundy estimator 
of the variance but not necessary for n > 2, as pointed out 
by Singh (1954), there will still be a possibility of obtaining 
a non-negative estimator of the variance. After relaxing the 
constraint (v) in (7), if the Yates-Grungy estimator of the 
variance comes out to be negative, an alternative variance 
estimator may be used. This has been demonstrated in 
Example 5 in Section 3. If even after relaxing constraint (v), 
a feasible solution of the quadratic programming problem is 
not found, constraint (iv) may also be relaxed and 
consequently an alternative variance estimator in place of 
the Yates-Grundy form of the HT variance estimator may be 
used. The effect of relaxing these constraints on efficiency 
of the proposed design is difficult to study, as after relaxing 
the non-negativity constraint (v) the Yates-Grundy estimator 
of the variance does not provide accurate results. Using the 
Yates-Grundy estimator of the variance, for some problems 
the variance estimate is smaller after relaxing constraint (Vv) 
[as in the case of Examples 2(a), 2(b) and 3(a) in Section 3] 
while for other problems it is larger [as in the case of 
Example l(a), 1(b), 3(b), 4(a) and 4(b) in Section 3]. 
Relaxing a constraint leading to an increased variance 
estimate may be due to the inability of the Yates-Grundy 
form of the variance estimator to estimate the true sampling 
variance correctly, when the non-negativity condition is not 
satisfied. 

The proposed method may also be considered superior to 
the earlier methods of optimal controlled selection in the 
sense that setting some samples to have zero selection 
probability is different from associating a cost with each 
sample and then trying to minimize the cost, the technique 
used in earlier approaches of controlled selection. The 
technique employed by the earlier authors for controlled 
selection was a crude approach giving some samples very 
high cost and others very low. 

One limitation of the proposed plan is that it becomes 
impractical when (N is very large, as the process of 
enumeration of all possible samples and formation of the 
objective function and constraints becomes quite tedious. 
This limitation also holds for the optimum approach of Rao 
and Nigam (1990, 1992) and other controlled sampling 
approaches discussed in Section 1. However, with the 
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advent of faster computing techniques and modern statistical 
packages, there may not be much difficulty in using the 
proposed procedure for moderately large populations. On 
the basis of the size of populations that we have considered 
in the empirical evaluation, we found that the proposed 
method can easily handle the controlled selection problems 
up to a population of 12 units and a sample of size 5. The 
proposed method may be used to select a small number of 
first-stage units from each of a large number of strata. This 
involves a solution of a series of quadratic programming 
problems, each of a reasonable size, provided the set of non- 
preferred samples is specified separately in each stratum. 

As in the case of linear programming, there is no 
guarantee of convergence of a quadratic programming 
problem. Kuhn and Tucker (1951) have derived some 
necessary conditions for the optimum solution of a quadratic 
programming algorithm but no sufficient conditions exist for 
convergence. Therefore unless the Kuhn-Tucker conditions 
are satisfied in advance, there is no way of verifying whether 
a quadratic programming algorithm converges to an absolute 
(global) or relative (local) optimum. Also, there is no way to 
predict in advance that the solution of a quadratic 
programming problem exists or not. 


2.3 Comparison of sampling variance of the estimate 


To estimate the population mean ¥(= N~'S, y,) based 
on a sample s of size n, we use the HT estimator of Y 
defined as 

A 1% 
ANE —. (9 
HT 2 Nn, ) 
Sen (1953) and Yates and Grundy (1953) showed 
independently that for fixed size sampling designs, Y,,; has 
the variance 


1 wv caper ty 
5 yy (1,0; — Tj) [2 a z| ) (10) 
N- TU Tt 


i<j=l i j 


Var) = 


and an unbiased estimator of VYur) is given as 


ee [ee TOT = Jeers re 

Viiviodw 22 aie lertl cee toil es Beles (ote ais 
Cam ar De - | | = (11) 
Constraint (v), when used in the proposed plan, ensures the 
non-negativity of the variance estimator (11). 

To demonstrate the utility of the proposed procedure, we 
use the empirical examples given in Section 3 to compare 
the true sampling variance of the HT estimator for the 
proposed procedure obtained through (10) with variances of 
the HT estimator using the optimal controlled plan of Rao 
and Nigam (1990, 1992) and those of two uncontrolled high 


entropy (meaning the absence of any detectable pattern or 
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ordering in the selected sample units) procedures of 
Goodman and Kish (1950) and Brewer and Donadio (2003). 
In what follows, we reproduce the expressions for the 
variances of these two high entropy procedures. 

The expression for variance of iat correct to 
O(N”) using the procedure of Goodman and Kish (1950) 


is given as 
wel 
nN? 


Gia = 
ae| Dn SAC Rela Re 
igen } (12) 
ieU 
SY =o Y 


ieU ieU 
, and U denotes the finite 


lopes-Ze Ee 
ieU ieU ieU 
where A, =Y,/ p, 
population of N units. 

Recently, Brewer and Donadio (2003) derived the Tl, - 
free formula for the high entropy variance of the HT 
estimator. They showed that the performance of this 
variance estimator, under conditions of high entropy, was 
reasonably good for all populations. Their expression for the 
variance of the HT estimator is given by 


Je Vaud vi 
Vast = => Tt, (1—c,T, jes Fgh (13) 
ieU TT; 
Wilete mare In (eV fie eT CT ae Lyrae 


pen ea ey Tr; } for all i€ U, which appears to perform 
better than the other values of c, suggested by them. 


3. Examples 


In this section, we consider some empirical examples to 
demonstrate the utility of the proposed procedure and 
compare it with the existing procedures of optimal 
controlled sampling. In the present discussion, we begin 
with the Midzuno-Sen (1952, 1953) IPPS design to 
demonstrate our procedure, as it is relatively easy to 
compute the probability of drawing every potential sample 
under this scheme. However, if the conditions of the 
Midzuno-Sen scheme are not satisfied, we demonstrate that 
other IPPS sampling without replacement procedures, such 
as the Sampford (1967) procedure, may also be used to 
obtain the initial IPPS design p(s). The true sampling 
variance of the HT estimator under the proposed plan is also 
compared with that of the existing procedures of optimal 
controlled selection and uncontrolled high entropy selection 
procedures given by (12) and (13). 

Example 1: Let us consider a population consisting of six 
villages, borrowed from Hedayat and Lin (1980). The set S 
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of all possible samples consists of 20 samples each of size 
n=3. Due to the considerations of travel, organization of 
fieldwork and cost considerations, Rao and Nigam (1990) 
identified the following 7 samples as non-preferred samples: 


123) 126; 136; 146; 234; 236; 246 


(a). The ¥, and p, values associated with the six villages of 
the population are: 
Nee 12 15 17 24 17 12 
Pi: 0.14 0.14 0.15 0.16 0.22 0.19 


Since the p, values satisfy the condition (1), we apply the 
MS scheme (3) to get an IPPS plan with the revised normal 
size measures ( DP, *s) given by (2). 

Applying the method discussed in Section 2 and solving 
the resulting quadratic programming problem with the 
Microsoft Excel Solver of Microsoft Office 2000 package, 
we obtain the controlled IPPS plan given in Table 1. 


Table 1 Optimal controlled IPPS plan corresponding to 
Midzuono-Sen (MS) and Sampford’s (SAMP) 
schemes for Example 1 


s_ pils) [MS] pi(s)[SAMP]_ s_pi(s) [MS] __ pis) [SAMP] 
124 =0.14 0.09 245 0.03 0.12 
1259) 0:08 0.05 256 0.13 0.14 
134 ~—-0..00 0.00 345 0.02 0.06 
135 0:09 0.03 346 0.20 0.10 
145 0.03 0.06 356 0.06 0.06 
lea, Ws 0.07 456 0.06 0.16 


AI) OS 0.05 


This plan matches the original 7m, values, satisfies the 
condition 1, <7,m%, and ensures that the probability of 
selecting non-preferred samples is exactly equal to zero. 
Obviously, due to the fulfillment of the condition 
T,, <1,%,, we can apply the Yates-Grundy form of the HT 
variance estimator for estimating the variance of the 
proposed plan. 

We have also solved the above example, using plan (3) 
of Rao and Nigam (1990, page 809) with specified 1, ’s 
taken from the Sampford’s plan [to be denoted by RN3] and 
their plan (4) [to be denoted by RN4]. Using the RN3 plan, 
the probability of non- preferred samples () comes out to 
be 0.155253 and using the RN4 plan with c=0.005, 0 
comes out to be zero, whereas the proposed plan always 
ensures zero probability to non-preferred samples. 

The values of the true sampling variance of the HT 
estimator ace )] for the proposed plan, the RN3 plan, the 
RN4 plan, the Randomized Systematic IPPS sampling plan 
of Goodman and Kish (1950) [to be denoted by GK] and the 
uncontrolled high entropy sampling plan of Brewer and 
Donadio (2003) [to be denoted by BD ] are produced in the 
first row of Table 2. It is clear from Table 2 that the 
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proposed plan yields almost the same value of variance of 
the HT estimator as yielded by the RN4 plan. The value of 
V (Yur) for the proposed plan is slightly higher than those 
obtained from the RN3, GK and BD plans. This increase in 
variance may be acceptable given the elimination of 
undesirable samples by the proposed plan. 


Table 2 Values of the true sampling variance of the HT 
estimator [V(Y,,7)] for the Proposed, RN3, RN4, GK 
and BD plans 


a PROPOSED 

V (Yur) RN3_RN4_ GK ___ BD PLAN 
Ex1(a) 

N= 6,7 =33 293 GeO 2 a 3.038. guage 4.06 
Ex 1(b) 

M=6ge3 437 69 5:07 994.89 24S 4.78 
Ex 2(a) 

NEO = 448 5.01 461 4.45 3.56 
EZ) 

Nise irik 3 11297 $44.52 992;25 1144 9.49 
Ex 3(a) 

Neitiress 485 4.29 4.96 4.86 3.90 
Ex 3(b) 

NECirs3 TOV 8 ABE) STAY TEST 8.17 
Ex 4(a) 

N= Soa SAO 346. (3.234 0a. 15 1) 
Ex 4(b) 

N=, ni=4 A aD DOs Dede 2-40 225 
Ex) 

Neweni= 4 3:08 S393) w 31 2a 307 5.10 


(b). Now suppose that the p; values for the above population 
of 6 units are as follows: 


Di: 0.10 0.15 0.10 0.20 0.27 0.18 


Since these values of p, do not satisfy the condition (1) of 
the MS plan, we apply the Sampford (1967) plan to get the 
initial IPPS design p(s) using (4). 

Applying the method discussed in Section 2 and solving 
the resultant quadratic programming problem, we obtain the 
controlled IPPS plan given in Table 1. This plan again 
ensures zero probability to non-preferred samples and 
satisfies the non-negativity condition for the Yates-Grundy 
form of the HT variance estimator. This example was also 
solved by the RN3 and RN4 plans. The value of © for the 
RN3 plan is 0.064135 and the value of o for the RN4 plan 
with c=0.005 is zero. The proposed plan always ensures 
zero probability to non-preferred samples. 

The values of V(¥,,,) for the proposed plan, the RN3 
plan, the RN4 plan, the GK plan and the BD plan are 
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produced in the second row of Table 2. The proposed plan 
appears to perform better than the RN4 and GK plans and 
quite close to other plans considered by us. 

Further examples were constructed to analyze the 
performance of the proposed plan. The populations with Y, 
and p, values and the set of non-preferred samples for each 
population are summarized in the Appendix. The p, values 
for Examples 2(a), 3(a) and 4(a) satisfy the condition (1) of 
Midzuno-Sen plan and hence for these examples the 
Midzuno-Sen IPPS plan is used to obtain the initial IPPS 
design p(s). However, for Examples 2(b), 3(b) and 4(b) 
the p, values do not satisfy this condition and therefore we 
apply the Sampford IPPS plan to obtain the initial IPPS 
design. The probabilities of non-preferred samples () for 
these examples using the RN3 plan, the RN4 plan and the 
proposed method are produced in Table 3. Table 3 shows 
that while the RN3 and RN4 plans only attempt to minimize 
the probability of non-preferred samples, the proposed plan 
always ensures zero probability to non-preferred samples. 

The values of V (Yur) for the proposed plan, the RN3 
plan, the RN4 plan, the GK plan and the BD plan for the 
population summarized in the Appendix are given in Table 
2. From Table 2 we conclude that for all the empirical 
problems considered by us, the proposed plan appears to 
perform better than or quite close to the RN3, RN4, GK and 
BD plans. The increase in variance of the estimate for the 
proposed plan in some cases may be acceptable given the 
elimination of undesirable samples by the proposed plan. 


Table 3 The probabilities of non-preferred samples using 


RN3, RN4 and Proposed plans 
Probability of non- RN3 RN4 Proposed 
preferred samples (0) PLAN PLAN Plan 
Example 2(a) 
N=d,n=3 0.06 0.(¢=0.5) 0 
Example 2(b) 
N=] =3 0.05 0 (c =0.5) 0 
Example 3(a) 
Nes A=3 0.12 0 (c = 0.005) 0 
Example 3(b) 
INPES ey, FNS 0.17 0 (c = 0.005) 0) 
Example 4(a) 
N=8,n=4 0.05 0 (c = 0.005) 0 
Example 4(b) 
N=8,n=4 0.13 0 (c = 0.005) 0 
Example 5 
N=) n= 4 0.30 0.1008 (c = 0.5) 0 


Example 5: We now consider one more example to 
demonstrate the situation where the proposed plan fails to 
provide a feasible solution satisfying all the constraints in 
(7). In such situations, we have to drop a constraint in (7) to 
obtain a feasible solution of the related quadratic pro- 
gramming problem. 
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Consider a population of seven villages. Suppose a 
sample of size n=4 is to be drawn from this population. 
There are 35 possible samples, out of which the following 
14 are considered as non-preferred: | 


[sisy7/e 
3456: 


1234; 
2345; 


1236; 
2346; 


1246; 
2456; 


1346; 
2567; 


1456; 
3567; 


1567; 
4567. 


Suppose that the following p, values are associated with 
the seven villages: 


Pee aS tS, O15 = OS OO U. Ls male. 


Since the p, values satisfy condition (1), we apply the 
MS plan (3) to obtain the initial IPPS design p(s) and 
solve the quadratic programming problem by the method 
discussed in Section 2. However, no feasible solution of the 
related quadratic programming problem exists in this case. 
Consequently, we drop constraint (v) in (7) for this 
particular problem to obtain a feasible solution of the 
quadratic programming problem. The probabilities of non- 
preferred samples using the RN3 plan, the RN4 plan and the 
Proposed plan for this empirical problem are given in the 
last row of Table 3. The proposed plan again matches the 
original 7, values and ensures the probability of selecting 
the non-preferred samples exactly equal to zero. However, 
due to non-fulfillment of the condition 1, < 7,7, for this 
example, the non-negativity of the Yates-Grundy estimator 
of the variance is not ensured. The values of the true 
variance, V(¥,,,), for the proposed plan, the RN3 plan, the 
RN4 plan, the GK plan and the BD plan are produced in the 
last row of Table 2. The value of V (Yar) for this empirical 
example using the proposed plan does not appear to be 
satisfactory. For such problems where constraint (v) is not 
satisfied, we suggest the use of alternative variance 
estimators in place of the Yates-Grundy variance estimator. 

We have also solved one more example with N =9 and 
n=4 using both the Midzuno-Sen and Sampford’s 
methods for obtaining the initial IPPS design p(s). The 
details of these solutions are omitted for brevity and can be 
obtained from the authors. 


4. Conclusion 


We have proposed a quadratic programming approach to 
solve the controlled sampling problems ensuring zero 
probability to non-preferred samples. The concept of 
‘nearest proportional to size sampling designs’ of Gabler 
(1987) is used to obtain the proposed plan. The approach is 
simple in concept and is very flexible in allowing for a 
range of different objective functions as well as in 
permitting a variety of constraints. The only limitation of the 
procedure is that it cannot be applied to large populations, as 
the computational process becomes quite tedious for large 
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populations. The utility of the proposed procedure is 
demonstrated with the help of examples and its true 
sampling variance is empirically compared with that of 
existing controlled sampling plans and uncontrolled high 
entropy sampling procedures. The proposed plan performs 
suitably. 
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Appendix 


The populations for Example 2-4 with Y; and p; 
values and the set of non-preferred samples. 


Example 2. N =7,n = 3. 
123; 126; 136; 146; 234; 236; 246; 
[3/2 147. 1675-231: 240. 3472 407, 


Non-preferred samples: 


Y;: 12 15 17 24 17 19 25 
(a)<p;: lovOl2eT Od2jehO BoiqnOHl4 2550.20 KOASio O14 
(bp; 0.08) (0.08 © SOOM erOrt O24" 20.207 "'013 


Example 3. N = 8, n = 3. 

123;..126; 136; 146; 234; 236; 246; 
137; 147; 167; 237; 247; 347; 467; 
128; 178; 248; 458; 468; 478; 578. 


Non-preferred samples: 


Y;: 12 15 id 24 17 19 25 18 
Gp OV0n COLOR OLLS O12 0 1380135) 0.125.014 
(Oy OU UUs O20) OLS O1G. UNL ou i2s 2 U8 


Example 4. N = 8,7 = 4. 

Non-preferred samples: 1234; 1236; 1238; 1246; 1248; 1268; 1346; 
1348; 1357; 1456; 1468; 1567; 1568; 1678; 
2345; 2346; 2456; 2468; 2567; 2568; 2678; 
3456; 3468; 3567; 3678; 4567; 4678; 5678. 


Nee 12 15 17 24 17 19 25 18 
(pipe ee OES OZ ONS Se Ot O12 > OIL -0.13 
(Dp 0.095 O09: OTS” ON O12 7) O14: S017. 7010 
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GUIDELINES FOR MANUSCRIPTS 


Before having a manuscript typed for submission, please examine a recent issue of Survey Methodology (Vol. 32, No. 2 and 
onward) as a guide and note particularly the points below. Articles must be submitted in machine-readable form, preferably 
in Word. A paper copy may be required for formulas and figures. 


Layout 


Manuscripts should be typed on white bond paper of standard size (8’2 x 11 inch), one side only, entirely double 
spaced with margins of at least 1’ inches on all sides. 

The manuscripts should be divided into numbered sections with suitable verbal titles. 

The name and address of each author should be given as a footnote on the first page of the manuscript. 
Acknowledgements should appear at the end of the text. 

Any appendix should be placed after the acknowledgements but before the list of references. 


Abstract 


The manuscript should begin with an abstract consisting of one paragraph followed by three to six key words. Avoid 
mathematical expressions in the abstract. 


ap 


3.1 
3.2 


3.3 


Style 


Avoid footnotes, abbreviations, and acronyms. 

Mathematical symbols will be italicized unless specified otherwise except for functional symbols such as “exp(-)” 
and “‘log(-)’”, etc. 

Short formulae should be left in the text but everything in the text should fit in single spacing. Long and important 
equations should be separated from the text and numbered consecutively with arabic numerals on the right if they are 
to be referred to later. 

Write fractions in the text using a solidus. 

Distinguish between ambiguous characters, (e.g., w, © ; 0, O, 0; 1, 1). 

Italics are used for emphasis. Indicate italics by underlining on the manuscript. 


Figures and Tables 


All figures and tables should be numbered consecutively with arabic numerals, with titles which are as nearly self 
explanatory as possible, at the bottom for figures and at the top for tables. 

They should be put on separate pages with an indication of their appropriate placement in the text. (Normally they 
should appear near where they are first referred to). 


References 


References in the text should be cited with authors’ names and the date of publication. If part of a reference is cited, 
indicate after the reference, e.g., Cochran (1977, page 164). 

The list of references at the end of the manuscript should be arranged alphabetically and for the same author 
chronologically. Distinguish publications of the same author in the same year by attaching a, b, c to the year of 
publication. Journal titles should not be abbreviated. Follow the same format used in recent issues. 


Short Notes 


Documents submitted for the short notes section must have a maximum of 3,000 words. 
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In This Issue 


This issue of Survey Methodology opens with the seventh paper in the annual invited paper series in 
honour of Joseph Waksberg. The editorial board would like to thank the members of the selection 
committee - Gordon Brackstone, chair, Bob Groves, Sharon Lohr and Wayne Fuller — for having selected 
Carl-Erik Sarndal as the author of this year’s Waksberg Award paper. For this occasion, a special 
Workshop on Calibration and Estimation in Surveys (WCES) was organised on October 31" and 
November 1“ at Statistics Canada. Professor Carl-Erik Sarndal was the keynote speaker and presented his 
Waksberg paper. During the two days, 12 other speakers presented a paper and paid their tribute to Carl- 
Erik Sarndal. 

In his paper entitled “The Calibration Approach in Survey Theory and Practice” Sarndal discusses the 
development and application of calibration in survey sampling. He describes the concept of calibration in 
some detail and contrasts it with generalized regression. He then describes different approaches to 
calibration including the minimum distance method, instrumental variables, and model calibration. Several 
examples of calibration and alternatives are considered. 

Laaksonen discusses weighting in two phase sampling in which respondents to the first phase are asked 
if they are willing to participate in the second phase. The weighting thus has to deal with non-response at 
both phases of the survey, and also account for first-phase respondents who were unwilling to participate in 
the second phase. Using data from a Finnish survey on leisure-time activities, he empirically evaluates 
variations on a weighting method that uses response propensity modeling and calibration. 

The article by Ardilly and Lavallée discusses the weighting problem for the SILC (Statistics on Income 
and Living Conditions) survey in France. This survey uses a rotating sample plan with nine panels. To 
obtain approximate estimators without bias, the authors relied on the weight-share method. Longitudinal 
weighting is discussed first, and then cross-sectional weighting is also discussed. 

The paper by Kim, Li and Valliant deals with the problem of small cells or large weight adjustments 
when poststratification is used. The authors first describe several standard estimators and then introduce 
two alternative estimators based on cell collapsing. They study the performance of these estimators in terms 
of their effectiveness in controlling the coverage bias and the design variance. These properties are 
evaluated theoretically and also through a simulation study using a population based on the 2003 National 
Health Interview Survey. 

Mecatti proposes a simple multiplicity estimator in the context of multi-frame surveys. She first shows 
that the proposed estimator is design-unbiased. Then, she proposes an unbiased estimator of the variance of 
the multiplicity estimator. Using 29 simulated populations, she compares the multiplicity estimator with 
alternative estimators proposed in the literature. 

Haziza studies the problem of variance estimation for a ratio of two totals when marginal random hot 
deck imputation has been used to fill in missing data. Two approaches to inference are considered, one 
using an imputation model and a second one using a nonresponse model. Variance estimators are derived 
under two frameworks: the reverse approach of Shao and Steel (1999) and the traditional two-phase 
approach. 

In their paper, Chipperfield and Preston describe the without replacement scaled bootstrap variance 
estimator that was implemented in the Australian Bureau of Statistics’ generalized estimation system 
ABSEST. The without replacement scaled bootstrap estimator is shown to be more efficient than the with 
replacement scaled bootstrap estimator for stratified samples when the stratum sizes are small. In addition, 
the without replacement scaled bootstrap estimator was shown to require fewer replicates to achieve the 
same replication error as the with replacement estimator. For the ABSEST system, bootstrap variance 
estimators were chosen over other variance estimation methods for their computational efficiency and the 
without replacement bootstrap was selected for the reasons above. 
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In This Issue 


Oleson, He and Sun describe a Bayesian modelling approach for situations where the sampling design is 
stratified and the estimation procedure requires post-stratification. The method is illustrated with data from 
the 1998 Missouri Turkey Hunting Survey for which the strata were defined by the hunter’s place of 
residence but estimates were required at the county level. 

Fabrizi, Ferrante and Pacei discuss a methodology which is increasingly important in modern sample 
survey applications. They investigate the effect of borrowing strength from additional panel information for 
cross sectional household income estimates for small areas in Italy. The proposed methods seem to tackle a 
problem which may have further relevance for European Official Statistics, and possibly also in the area of 
small area statistics for indicators which may be used for policy research. 

Renaud presents an interesting application of a post-enumeration survey to estimate net undercoverage 
in the 2000 census in Switzerland. The objective of this survey was slightly different from that of other 
countries in that it was not designed to adjust the Census counts for net undercoverage, but rather to gather 
information to improve the quality of subsequent censuses. 

In the final paper, Elliot and Haviland consider combining a convenience sample with a probability 
based sample to obtain an estimate with a smaller MSE. The resulting estimator is a linear combination of 
the convenience and probability sample estimates with weights that are a function of the bias. By looking at 
the maximum incremental contribution of the convenience sample, they show that improvement to the 
MSE may be attainable only in certain circumstances. 


Harold Mantel, Deputy Editor 
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Waksberg Invited Paper Series 


The journal Survey Methodology has established an annual invited paper series in honour of Joseph 
Waksberg, who has made many important contributions to survey methodology. Each year a prominent 
survey researcher is chosen to author an article as part of the Waksberg Invited Paper Series. The paper 
reviews the development and current state of a significant topic within the field of survey methodology, and 
reflects the mixture of theory and practice that characterized Waksberg’s work. The author receives a cash 
award made possible by a grant from Westat, in recognition of Joe Waksberg’s contributions during his 
many years of association with Westat. The grant is administered financially by the American Statistical 
Association. Previous winners are listed below. Their papers in the series have already appeared in Survey 
Methodology. 


Previous Waksberg Award Winners: 


Gad Nathan (2001) 
Wayne A. Fuller (2002) 
Tim Holt (2003) 

Norman Bradburn (2004) 
J.N.K. Rao (2005) 
Alastair Scott (2006) 
Carl-Erik Sarndal (2007) 


Nominations: 


The author of the 2009 Waksberg paper will be selected by a four-person committee appointed by Survey 
Methodology and the American Statistical Association. Nominations of individuals to be considered as 
authors or suggestions for topics should be sent to the chair of the committee, Robert Groves, by email to 
bgroves@isr.umich.edu. Nominations and suggestions for topics must be received by February 29, 2008. 


2007 Waksberg Invited Paper 
Author: Carl-Erik Sarndal 


Carl-Erik Sarndal, retired professor in the Université de Montréal, is a consultant and expert who has 
been associated with several national statistical institutes, in particular Statistics Canada and Statistics 
Sweden as well as Statistics Finland, INSEE and Eurostat. His list of publications comprises three 
books, including the very well reknown Model Assisted Survey Sampling book that has had a major 
impact. He is also the author of numerous scientific articles, in sole authorship or in collaboration with 
researchers from many countries. His research interest in survey sampling has been very diversified, 
but most often revolved around ways to best using auxiliary information in sampling and estimation. 
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Members of the Waskberg Paper Selection Committee (2007-2008) 


Robert Groves, (Chair) 

Wayne A. Fuller, Jowa State University 

Daniel Kasprzyk, Mathematica Policy Research 
Leyla Mojadjer, Westat 


Past Chairs: 


Graham Kalton (1999 - 2001) 
Chris Skinner (2001 - 2002) 
David A. Binder (2002 - 2003) 

J. Michael Brick (2003 - 2004) 
David R. Bellhouse (2004 - 2005) 
Gordon Brackstone (2005 - 2006) 
Sharon Lohr (2006 - 2007) 
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The calibration approach in survey theory and practice 


Carl-Erik Sarndal ' 


Abstract 


Calibration is the principal theme in many recent articles on estimation in survey sampling. Words such as “calibration 
approach” and “calibration estimators” are frequently used. As article authors like to point out, calibration provides a 
systematic way to incorporate auxiliary information in the procedure. 


Calibration has established itself as an important methodological instrument in large-scale production of statistics. Several 
national statistical agencies have developed software designed to compute weights, usually calibrated to auxiliary 
information available in administrative registers and other accurate sources. 


This paper presents a review of the calibration approach, with an emphasis on progress achieved in the past decade or so. 
The literature on calibration is growing rapidly; selected issues are discussed in this paper. 


The paper starts with a definition of the calibration approach. Its important features are reviewed. The calibration approach 
is contrasted with (generalized) regression estimation, which is an alternative but conceptually different way to take 
auxiliary information into account. The computational aspects of calibration are discussed, including methods for avoiding 
extreme weights. In the early sections of the paper, simple applications of calibration are examined: The estimation of a 
population total in direct, single phase sampling. Generalization to more complex parameters and more complex sampling 
designs are then considered. A common feature of more complex designs (sampling in two or more phases or stages) is that 
the available auxiliary information may consist of several components or layers. The uses of calibration in such cases of 
composite information are reviewed. Later in the paper, examples are given to illustrate how the results of the calibration 
thinking may contrast with answers given by earlier established approaches. Finally, applications of calibration in the 
presence of nonsampling error are discussed, in particular methods for nonresponse bias adjustment. 


Key Words: Auxiliary information; Weighting; Consistency; Design-based inference; Regression estimator; Models; 
Nonresponse; Complex sampling design. 


1. Introduction quantitative, on which one wishes to carry out, jointly, an 
adjustment. 

Kott (2006) defines calibration weights as a set of 
weights, for units in the sample, that satisfy a calibration to 
known population totals, and such that the resulting 
estimator is randomization consistent (design consistent), or, 
more rigorously, that the design bias is, under mild 
conditions, an asymptotically insignificant contribution to 
the estimator’s mean squared error. This is the property I 
call “nearly design unbiased”. 

The Quality Guidelines (fourth edition) of Statistics 
Canada (2003) say: “Calibration is a procedure than can be 
used to incorporate auxiliary data. This procedure adjusts 
the sampling weights by multipliers known as calibration 
factors that make the estimates agree with known totals. The 
resulting weights are called calibration weights or final 
estimation weights. These calibration weights will generally 


1.1. Calibration defined 


It is useful in this paper to refer to a definition of the 
calibration approach. I propose the following formulation. 


Definition. The calibration approach to estimation for finite 
populations consists of 


(a) a computation of weights that incorporate specified 
auxiliary information and are restrained by 
calibration equation(s), 


(b) the use of these weights to compute linearly 
weighted estimates of totals and other finite 
population parameters: weight times variable value, 
summed over a set of observed units, 


(c) an objective to obtain nearly design unbiased 


estimates as long as nonresponse and other non- 
sampling errors are absent. 


In the literature, “calibration” frequently refers to (a) 
alone; I shall often use the term for (a) to (c) together. 
Earlier definitions, although less extensive, agree essentially 
with mine. Ardilly (2006) defines calibration (or, more 
precisely, “calage généralisé”) as a method of re-weighting 
used when one has access to several variables, qualitative or 


result in estimates that are design consistent, and that have a 
smaller variance than the Horvitz-Thompson estimator.” 
Part (c) of the definition merits a comment. Nothing 
prevents producing weights calibrated to given auxiliary 
information without requiring (c). But most published work 
on calibration is in the spirit of (c), so it makes good sense to 
include it. When non-sampling errors are present, bias in the 
estimates is unavoidable, whether they are made by 
calibration or by any other method. In line with (c), I 
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consider design-based inference to be the standard in this 
paper. The randomization-based variance of an estimator is 
thus important. However, the paper focuses on “motivations 
behind (point) estimation”; for reasons of space, the 
important question of variance estimation is not addressed. 


1.2 Comments arising 


The definition in Section 1.1 prompts some comments 
and references to earlier literature: 


(1) Calibration as a linear weighting method. Calibration 
has an intimate link to practice. The fixation on weighting 
methods on the part of the leading national statistical 
agencies is a powerful driving force behind calibration. To 
assign an appropriate weight to an observed variable value, 
and to sum the weighted variable values to form appropriate 
ageregates, is firmly rooted procedure. It is used in 
statistical agencies for estimating various descriptive finite 
population parameters: totals, means, and functions of totals. 
Weighting is easy to explain to users and other stakeholders 
of the statistical agencies. 

Weighting of units by the inverse of their inclusion 
probability found firm scientific backing long ago in papers 
such as Hansen and Hurwitz (1943), Horwitz and 
Thompson (1952). Weighting became widely accepted. 
Later, post-stratification weighting achieved the same status. 
Calibration weighting extends both of these ideas. 
Calibration weighting is outcome dependent; the weights 
depend on the observed sample. 

Inverse inclusion probability weights are, by definition, 
greater than or equal to unity. A commonly heard 
interpretation is that “an observed unit represents itself and a 
number of others, not observed”. Calibrated weights, on the 
other hand, are not necessarily greater than or equal to unity, 
unless special care is taken in the computation to obtain this 
property. 

Calibration is new as a term in survey sampling - about 
15 years old - but not as a technique for producing weights. 
Those who maintain “I practiced calibration long before it 
was called calibration” have a point. The last 15 years 
widened the scope and the appeal of the technique. 
Weighting akin to calibration has long been used by private 
survey institutes, for example, in connection with quota 
sampling, a form of non-probability sampling outside the 
scope of this paper. 

Weighting of observed variable values was an important 
topic before calibration became a popular term. Some 
authors derived the weights via the argument that they 
should differ as little as possible from the unbiased sampling 
design weights (the inverse of the inclusion probabilities). 
Others found the weights by recognizing that a linear 
regression estimator can be written as a linearly weighted 
sum of the observed study variable values. Terms such as 
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“survey sample weighting” and “regression weighting” and 
“case weighting” are used. Among such “early papers” are 
Alexander (1987), Bankier, Rathwell and Majkowski 
(1992), Bethlehem and Keller (1987), Chambers (1996), 
Fuller, Loughin and Baker (1994), Kalton and Flores- 
Cervantes (1998), Lemaitre and Dufour (1987), Sarndal 
(1982) and Zieschang (1990). I comment later on the 
technique “repeated weighting’, promoted by the Dutch 
national statistical agency, CBS. The newer term 
“calibration” conveys a more specific message and a more 
definite direction than the older “weighting”. 


(2) Calibration as a systematic way to use auxiliary 
information. Calibration provides a systematic way to take 
auxiliary information into account. As Rueda, Martinez, 
Martinez and Arcos (2007) point out, “in many standard 
settings, the calibration provides a simple and _ practical 
approach to incorporating auxiliary information into the 
estimation”. 

Auxiliary information was used to improve the accuracy 
of survey estimates long before calibration became popular. 
Numerous papers were written with this goal in mind, for 
more or less specialized situations. Today, calibration does 
offer a systematic outlook on the uses of auxiliary 
information. For example, calibration can deal effectively 
with surveys where auxiliary information exists at different 
levels. In two-stage sampling information may exist for the 
first stage sampling units (the clusters), and other 
information for the second stage sampling units. In surveys 
with nonresponse (that is, essentially all surveys), infor- 
mation may exist “at the population level’ (known 
population totals), and other information “at the sample 
level” (auxiliary variable values for all those sampled, 
responding and  non-responding). Calibration with 
“composite information” is reviewed in Sections 8 and 9. 

Regression estimation, or generalized regression (GREG) 
estimation, competes with calibration as a systematic way to 
incorporating auxiliary information. It is therefore important 
to contrast GREG estimation (described in Section 3) with 
calibration estimation (described in Section 4). The two 
approaches are different. 


(3) Calibration to achieve consistency. Calibration is often 
described as “a way to get consistent estimates”. (Here 
“consistent” refers not to “randomization consistent” but to 
“consistent with known aggregates”.) The calibration 
equations impose consistency on the weight system, so that, 
when applied to the auxiliary variables, it will confirm (be 
consistent with) known aggregates for those same auxiliary 
variables. A desire to promote credibility in published 
statistics is an often cited reason for demanding consistency. 
Some users of statistics dislike finding the same population 
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quantity estimated by two or more numbers that do not 
agree. 

The totals with which consistency is sought are 
sometimes called control totals. “Controlled weights” or 
“calibrated weights” suggest improved, more accurate 
estimation. The French term for calibration, “calage”, has a 
similar connotation of “stability”. 

Consistency through calibration has a broader implica- 
tion than just agreement with known population auxiliary 
totals. Consistency can, for example, be sought with 
appropriately estimated totals, arising in the current survey 
or in other surveys. 

Consistency among tables estimated from different 
surveys is the motive behind repeated weighting, the 
technique developed at the Dutch national statistical agency 
CBS in several articles: Renssen and Nieuwenbroek (1997); 
Nieuwenbroek, Renssen, and Hofman (2000); Renssen, 
Kroese, and Willeboordse (2001); Knottnerus and van Duin 
(2006). The stated objective is to accommodate user 
demands to produce numerically consistent outputs. As the 
last mentioned paper points out, repeated weighting can be 
seen as an additional calibration step for a new adjustment 
of already calibrated weights. The final weights realize 
consistency with given margins. 

Consistency with known or estimated totals may bring 
the extra benefit of improved accuracy (lower variance 
and/or reduced nonresponse bias). However, in some 
articles, especially those authored in statistical agencies, 
consistency for user satisfaction seems a more imperative 
motivation than the prospect of increased accuracy. 

When the primary motivation for calibration is not so 
much an agreement with other statistics as rather to reduce 
variance and/or nonresponse bias, then “balanced weight 
system” is a more appropriate description than “consistent 
weight system’, because the objective is then to balance the 
weights to reflect the outcome of the sampling, the response 
to the survey, and the information available. 


(4) Calibration for convenience and transparency. As 
Harms and Duchesne (2006) point out, “The calibration 
approach has gained popularity in real applications because 
the resulting estimates are easy to interpret and to motivate, 
relying, as they do, on design weights and natural 
calibration constraints.” Calibration on known totals strikes 
the typical user as transparent and natural. Users who 
understand sample weighting appreciate that calibration 
leaves the design weights “slightly modified only”, while 
respecting the controls. The unbiasedness is only negligibly 
disturbed. The simpler forms of calibration invoke no 
assumptions, only “natural constraints’. Yet another 
advantage is appreciated by users: In many applications, 
calibration gives a unique weighting system, applicable to 
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all study variables, of which there are usually many in large 
government surveys. 


(5) Calibration in combination with other terms. Some 
authors use the word “calibration” in combination with 
other terms, to describe various directions of thought. 
Examples of this proliferation of terms are: Model- 
calibration (Wu and = Sitter 2001); g-calibration 
(Vanderhoeft, Waeytens and Museux 2000); Harmonized 
calibration (Webber, Latouche and Rancourt 2000), Higher 
level calibration (Singh, Horn and Yu 1998); Regression 
calibration (Demnati and Rao 2004); Non-linear calibration 
(Plikusas 2006); Super generalized calibration (Calage super 
généralisé; Ardilly 2006); Neural network model-calibration 
estimator and Local polynomial model-calibration estimator 
(Montanari and Ranalli 2003, 2005), Model-calibrated 
pseudo empirical maximum likelihood estimator (Wu 
2003), and yet others. Also, calibration plays a significant 
role in the indirect sampling methods proposed in Lavalleé 
(2006). In a somewhat different spirit, not reviewed here, 
are concepts such as calibrated imputation (Beaumont 
2005a), and bias calibration (Chambers, Dorfman and 
Wehrly (1993), Zheng and Little (2003)). The following 
review pages do not give justice to all the innovations within 
the sphere of calibration, but the names alone do suggest 
directions that have been explored. 


(6) Calibration as a new direction for thought. If 
calibration represents “a new approach’ with clear 
differences compared with predecessors, we must examine 
such questions as: Does calibration generalize earlier 
theories or approaches? Does calibration give better, more 
satisfactory answers on questions of importance, as 
compared with earlier recognized approaches? Sections 4.5 
and 7.1 in this paper illustrate how the answers provided by 
calibration compare with, or contrast with, those obtained in 
earlier modes of reasoning. 

The practice of survey sampling encounters “nuisances” 
such as nonresponse, frame deficiencies and measurement 
errors. It is true that imputation and reweighing for non- 
response are widely practiced, through a host of techniques. 
But they are somehow “separate issues”, still waiting to be 
more fully embedded into a comprehensive, more 
satisfactory theory of inference in sample surveys. Many 
theory papers deal with estimation for an imagined ideal 
survey, nonexistent in practice, where nonresponse and 
other non-sampling errors are absent. This is not a criticism 
of the many excellent but idealized theory papers. The 
foundations need to be explored, too. 

Sections 9 and 10 indicate that calibration can provide a 
more systematic outlook on inference in surveys even in the 
presence of the various non-sampling errors. Future fruitful 
developments are expected in that regard. 
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2. Basic conditions for design-based estimation 
in sample surveys 


This section sets the background for Sections 3 to 7. By 
“basic conditions” I will mean single phase probability 
sampling of elements and full response. In practice, survey 
conditions are not that simple and perfect, but many theory 
papers nevertheless address this situation. 

A probability sample s is drawn from the finite 
population U = {1, 2,...,k,..., N}. The probability sampling 
design generates for element & a known inclusion 
probability, 2, >0, and a corresponding sampling design 
weight d, =1/n,. The value y, of the study variable y is 
recorded for all k es (complete response). The objective is 
to estimate a population total Y=>, y, with the use of 
auxiliary information. The study variable y may be 
continuous or, as in many government surveys, categorical. 
For example, if y is dichotomous with value y, =0 or 
y, =1 according as person k is employed or unemployed, 
then the parameter Y=, y, to be estimated is the 
population count of unemployed people. (If 4 cU isa set 
of elements, I write }, for >,.,.) The basic design 
unbiased estimator of Y is ‘ome => ,4d,¥,, the Horwitz- 
Thompson estimator. It is, however, inefficient when 
powerful auxiliary information is available for use at the 
estimation phase. 

The general notation for the auxiliary vector will be x,. 
In some countries, for some surveys, the sources of auxiliary 
data permit extensive vectors x, to be built. But some 
examples of simple vectors are: (1) x, =(1, x,)’, where x, 
is the value for element k of a continuous auxiliary variable 
x; (2) the classification vector used to code membership in 
one of P mutually exclusive and exhaustive groups, 
X= Ve = (Nes Vow Spy) ot SOthatafor p= 1i2,,.. 5d 
Y px =1 if k belongs to group p, and y,, =0 if not; (3) the 
combination of (1) and (2), x, =(¥;,*;¥;,)'3 (4) the vector 
x, that codifies two classifications stringed out ‘side-by- 
side’, the dimension of x, being P + QO — 1, where P and Q 
are the respective number of categories, and the ‘minus-one’ 
is to avoid a singular matrix in the computation of weights 
calibrated “to the margins”; (5) the extension of (4) to more 
than two ‘side-by-side’ categorical classifications. Cases 4 
and 5 are particularly important for production in national 
statistical agencies. 

In calibration reasoning it is crucially important to 
specify exactly the auxiliary information. Under the basic 
conditions we need to distinguish two different cases 
relative to x;: 


(1) x, 1S a known vector value for every keU 
(complete auxiliary information) 


(11) vy X, 18 known (imported) total, and x, is known 
(observed) for every kes 
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It is often the survey environment that dictates whether 
(i) or (ii) prevails. Case (i), complete auxiliary information, 
occurs when x, is specified in the sampling frame for every 
k €U (and thus known for every k € s ). This environment 
is typical of surveys on individuals and households in 
Scandinavia and other North European countries equipped 
with high quality administrative registers that can be 
matched with the frame to provide a large number of 
potential auxiliary variables. The population total },x, 1s 
obtained simply by adding the x,. 

Case (i) gives considerable freedom in structuring the 
auxiliary vector x,. For example, if x, is a continuous 
variable value specified for every keU, then we are 
invited to consider x; and other functions of x, for 
inclusion in x,, because totals such as Yyx; and 
dv logx, are readily computed. If the relationship to the 
study variable y is curved, it may be a serious omission not 
to take into account known totals such as the quadratic one 
or the logarithmic one. 

Case (11) prevails in surveys where (1) is not met, but 
where >,,x, 1S imported from an outside source 
considered accurate enough, and the individual value x, 
is available (observed in data collection) for every kes. 
Then >, x, is sometimes called an “independent control 
total”, to mark its origin from outside the survey itself. 
Case (11) is less flexible: If x, is a variable with a total 
vx, imported from a reliable source, then Y,x; may 
be unavailable, barring x; from inclusion into x,. 


3. Generalized regression estimation under 
the basic conditions 


3.1 The GREG concept 


Before examining calibration, let us consider generalized 
regression (GREG) estimation (or just regression 
estimation), for two good reasons: (1) GREG estimation can 
also be claimed to be a systematic way to take auxiliary 
information into account; (2) some (but not all) GREG 
estimators are calibration estimators, in that they can be 
expressed in terms of a calibrated linear weighting. 

GREG estimators and calibration estimators have been 
extensively studied in the last two decades. The terms alone, 
“GREG estimation” and “calibration estimation’, reflect a 
clear difference in thinking. Statisticians who work in the 
area are of two types: Those dedicated to “GREG thinking” 
and those dedicated to “calibration thinking’. The 
distinction may not be completely clear-cut, but it helps 
structuring this review paper, so I will use it. I am not 
venturing to say that the latter thinking is more prevalent in 
national statistical agencies and the former more prevalent 
in the academic circles, but perhaps there is such a tendency. 
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The GREG estimator concept evolved gradually since 
the mid-1970’s. The simple (linear) GREG is explained in 
Sarndal, Swensson and Wretman (1992); a thorough review 
of regression estimation is given in Fuller (2002). The 
central idea is that predicted y-values j, can be produced 
for all N population elements, via the fit of an assisting 
model and the use of the auxiliary vector values x, known 
for all k ¢U. The predicted values serve to build a nearly 
design unbiased estimator of the population total Y = Yy y, 
as 


Ve oh = GY: =) ,) 


Fi Nt + Crap =) 4,59). 


The obvious motivation behind this construction is the 
prospect of a highly accurate estimate foe. through a 
close fitting assisting model that leaves small residuals 
y, —¥,. That modeling is the corner stone of GREG 
thinking. Some authors use the (also justifiable) name 
general difference estimator for the construction (3.1). 

The great variety of possible assisting models generates a 
wide family of GREG estimators of the form (3.1). The 
assisting model, an imagined relationship between x and y, 
can have many forms: linear, non-linear, generalized linear, 
mixed (model with some fixed, some random effects), and 
so on. Whatever the choice, the model is “assisting only’; 
even though it may be short of “true”, (3.1) is nearly deign 
unbiased under mild conditions on the assisting model and 
on the sampling design, so that (aes AN i=Op (n'") 
and (Yoreg —Y)/N =(Yoree1in ~Y)/N +O,(n™), where 
the statistic YopeG iin» the result of linearizing Yopp,, 18 
unbiased for Y. 


(3.1) 


3.2. Linear GREG 


By linear GREG I mean one that is generated by a linear 
fixed effects assisting model. The predictions are 
Dy ei x, B,. i, with 


a (y ceva xt a (d 4e%%i x) 


so (3.1) becomes 


aA 


Yorec = [z, x, ) Bet Dia; O, - x, By. ag ), 


The q, are scale factors, chosen by the statistician. The 
standard choice is g, =1 for all k. The choice of the g, has 
some (but often limited) impact on the accuracy of Yopug3 
near-unbiasedness holds for any specification (barring 
outrageous choices) for the g,. Although the model is 
simple, the linear GREG (3.2) contains many estimators, 
considering the many possible choices of the auxiliary 
vector x, and the scale factors g,. Under general 
conditions, 


(3.2) 
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Cea VON (Sa Eye ME NOL) 


where >',d,E, is the Horvitz-Thompson estimator in the 
residuals E, =y, -x,By., with By., =(0uv4,%:X;) 
(Su %X,¥;,). Hence, the design-based properties 
E(Yonrg)*Y and Var(Yoerg) ¥ Var(X,d,E,). A close 
fitting /inear regression of y on x holds the key to a small 
variance for pa (and this is very different from claiming 
that “a linear regression is the true regression’). 

The linear GREG in Sdrndal, Swensson and Wretman 
(1992) was motivated via the linear assisting model € 
stating that E.(y,)=B'x, and V.(y,)=0;. Generalized 
least squares fit gives the estimator (3.2) with g, =1/o0%. In 
that context, an educated guess about the variation of the 
residuals y, — B’x, determines the g,. When the vector x, 
is fixed, the modeling effort boils down to an opinion about 
the residual pattern. The choice o,=0°x, gives the 
classical ratio estimator. If g, =p’x, for all keU anda 
constant vector p, then (3.2) reduces to “the cosmetic 
form” (Xv X;)'B,,4,- 

As Beaumont and Alavi (2004) and others have pointed 
out, the linear GREG estimator is bias-robust (nearly 
unbiased although the assisting model falls short of 
“correct’), but it can be considerably less efficient (have 
larger mean squared error) than model dependent alter- 
natives which, although biased, may have a considerably 
smaller variance. Thus one may claim that linear GREG is 
not variance robust; nevertheless, it is a basic concept in 
design-based survey theory. 

The specification of x, should include variables (with 
known population totals) that served already in defining the 
sampling design. Design stage information should not be 
relinquished at the estimation stage; instead, a “repeated 
usage” is recommended. For example, in stratified simple 
(STSI) random sampling, the vector x, in estimator (3.2) 
should include, along with other available variables, the 
dummy coded stratum identifier, y, =(Vi5 Yo. +5 
Yun Yeux), Where y,, =1 if element k belongs to stratum 
hyand:y,,.=0 if noth=1, «., 1. 

We can write the linear GREG (3.2) as a weighted 
sample sum, Young =XsW,),> With 


W, = 4, 2,3 & =1+q, AX; 
' = 
ll kee 2 All Soa 


The weights w, happen to be calibrated to (consistent 
with) the known population x-total: ©, w,x, =iyx,. That 
Yoerg iS expressible as a linearly weighted sum with 
calibrated weights is a fortuitous by-product. It is not part of 
GREG thinking, whose central idea formulated in (3.1) is 


the fit of an assisting model. A few other GREG’s than the 
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simple linear one also have the calibration property, as will 
be noted later. 


3.3. Non-linear GREG 


Two features of the linear GREG (3.2) make it a 
favourite choice for routine production in_ statistical 
agencies: (i) the auxiliary population total },,x, becomes 
factored out, so the estimation can proceed as long as an 
accurate value for that total can be computed or imported, 
and (ii) when written as the linearly weighted sum 
ee =>,w,y,, the weight system (3.3) is independent of 
the y-variable and can thereby be applied to all y-variables 
in the survey. We need not know x, individually for all 
k eU; knowing >} x, suffices. Needless to say, if we do 
know all x,, more efficient (still nearly design unbiased) 
members of the GREG family (3.1) can be sought. This will 
also counter another criticism of the linear GREG, namely 
that a linear model is unrealistic for some types of data. For 
example, for a dichotomous y-variable, a logistic assisting 
model may be both more realistic and yield a more precise 
GREG estimator. 

By a non-linear GREG estimator I mean one generated 
as in (3.1) by an assisting model of other type than “linear in 
x, with fixed effects”. Among the first to extend the GREG 
concept in this direction are Firth and Bennett (1998) and 
Lehtonen and Veijanen (1998); see also Chambers et al. 
(1993). In the last few years, several authors have studied 
model-assisted non-linear GREG’s. 

Non-linear GREG is a versatile idea; a variety of 
estimators become possible via assisting models € of the 
following type: 


E.(y|X,)=h, for keU (3.4) 
where the model mean wp, and the model variance 
V.(y,|X,) are given appropriate formulations. 

One application of (3.4) is when pw, =L(x,,8) is a 
specified non-linear function in x,. Having estimated @ by 
6, the fitted values needed for Yo., in (3.1) are 
y, = 1(x,, 6) for keU. For example, if the modeler 
specifies logu, =a+fx,, the predictions for use in (3.1) 
are, following parameter estimation, p, = exp(@+ 6 3) 

Other applications of (3.4) include generalized linear 
models such that g(u,)=x,0, for a specified link function 
g(), and V,(),|x;,)= V(Hy) is given an appropriate 
structure. We estimate @ by 9, the fitted values needed for 
the non-linear GREG estimator (3.1) are 
y, =f, = g'(x'6). For example, using a logistic assisting 
model, x,9 = logit(u,) = log(u, /d-p,)), and 
y, = O, =exp(x,6) /(1+ exp(x,8)). 

Lehtonen and Veijanen (1998) examine the case of a 
categorical study variable with J classes, i=1, 2,..., J, 
Vy, =1 if element k belongs to category i, and y, =0 if 
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not. For example, in a Labour Force Survey with / = 3 
categories, “employed”, “not employed” and “not in the 
labour force”, an objective is to estimate the respective 
population counts ¥, = +, y,,i=1,2,3. These authors use 


the logistic assisting model 


ii 
E. (Vig Xi) = Bins Hig = ext) /[ +>) ex) |.) 


i=2 


Estimates 0, of the 0, are obtained by maximizing 
design-weighted log-likelihood. The resulting predictions 
Dy =f, are used, to form Viguec. = Lu Sgt 
De AV aes lOr Eh 

Another development is the application of GREG 
reasoning to estimation for domains, as in Lehtonen, 
Sarndal and Veijanen (2003, 2005) and Myrskyla (2007). 
Mixed models are used in the first two of these papers to 
assist the non-linear GREG. Let U, be a domain, U, CU, 
whose total Y,, = Xv, yj, we wish to estimate, i =1, 2,..., J. 
The 2005 paper derives the predictions for the non-linear 
GREG from the logistic mixed model stating that for 
keu, 


I 
Ee Vix IX,5 Uj, ) = ests) [| + > exp(x;,;, ) (3.6) 


i=2 


with 0,,=B,+u,,, where u,, is a vector of domain 
specific random deviations from the fixed effects vector B,. 

Non-linear GREG’s assisted by models such as (3.5) and 
(3.6) require model fitting for every y-variable separately; 
there is no uniformly applicable weight system. However, 
the question arises: Are there examples of non-linear 
GREG’s such that the practical advantages of linear GREG 
are preserved, that is, a linearly weighted form with 
calibrated weights independent the y-variable. The answer is 
in the affirmative. Two directions in recent literature are of 
interest in this regard: 

Breidt and Opsomer (2000), Montanari and Ranalli 
(2005) consider model-assisted local polynomial GREG 
estimators, for the case of a single continuous auxiliary 
variable with values x, known for all keU. Several 
choices have to be made in the process: (1) the order g of 
the local polynomial expression, (2) the specification of the 
kernel function, and (3) the value of the band width. The 
resulting estimator can be expressed in terms of weights 
calibrated with respect to population totals of the powers of 
Xe 90 thot Wee Se Of = Ue ear 

Breidt, Claeskens and Opsomer (2005) develop a 
penalized spline GREG estimator for a single x-variable; the 
assisting model is  m(x;B)=B, +B,x+...+B,x7 + 
Dya1B,.;(*—-«,)?, Where (7 =7¢7 if ¢>0 and 0 
otherwise, g is the degree of the spline, and the x, are 
suitably spaced knots, for example, uniformly spaced 
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sample quantiles of the x,-values. After estimation of the 
B -parameters, they obtain the predictions }, =m(x,; B) 
needed for the general GREG formula (3.1). The authors 
point out that the resulting GREG estimator is calibrated for 
the parametric portion of the model, that is, 
Pie, foningi=0j4jarq) lands also «fors:the 
truncated polynomial terms in the model as long a they are 
left unpenalized. 

We can summarize GREG estimation as follows. The 
linear GREG has practical advantages for large scale 
Statistics production: It can be expressed as a linearly 
weighted sum of y,-values with weights calibrated to 
du X,, the weights are independent of the y, -values and 
may be applied to all y-variables in the survey. It is 
sufficient to know a population auxiliary total >,,x,, 
imported from a reliable source. Non-linear GREG may 
give a considerably reduced variance, as a result of the more 
refined models that can be considered when there is 
complete auxiliary information (known x, for all k €U); 
near design unbiasedness is preserved. Certain non-linear 
GREG’s can be written as linearly weighted sums. 

In academic exercises with artificially created 
populations and relationships, one can provoke situations 
where a nonlinear GREG has a large variance advantage 
over a linear GREG. Such experiments are important for 
illustration. However, to meet the daily production needs in 
national statistical agencies; “‘farfetched” nonlinear GREG’s 
seem to be of fairly remote interest at this point in time; the 
assisting models for GREG must meet requirements of 
robustness and practicality. The attraction of a minor 
reduction of the sampling variance is swept away by worries 
about other (non-sampling) errors and troubles in the daily 
production process. 

The progression from linear to non-linear GREG creates 
opportunities and generates questions. What is the most 
appropriate formulation of the model expectation u,? How 
sensitive are the results to the specification of the variance 
part of the assisting model? To what extent is computational 
efficiency an issue? Further research will respond more 
fully to these questions. 


4. The calibration approach to estimation 


4.1 Calibration under basic conditions 


A crucial step in the GREG approach reviewed in the 
previous section is to produce predicted values , through 
the fit of an assisting model. By contrast, the calibration 
approach, as defined in Section 1.1, does not refer explicitly 
to any model. It emphasizes instead the information on 
which one can calibrate. A key element of “calibration 
thinking” is the linear weighting of the observed y-values, 
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with weights made to confirm computable aggregates. This 
conceptual difference will sometimes lead to different 
estimators in the two approaches. 

The calibration approach has considerable generality; it 
can deal with a variety of conditions: complex sampling 
designs, adjustments for nonresponse and frame errors. This 
section, however, focuses on the basic conditions in Section 
2: single phase sampling and full response. The notation 
remains as in Section 2. The material available for 
estimating the population total Y= >, y, is: (i) the study 
variable values y, observed for kes, (ii) the known 
design weights d,=1/n, for k eU, and (iti) the known 
vector values x, for k €U (or an imported total >}, x,). 
These simple conditions prevail in Deville and Séarndal 
(1992) and Deville, Sarndal and Sautory (1993), papers 
which gave the approach a name and inspired further work. 
Even though the background is simple, calibration raises 
several issues, some of them computational, as reviewed in 
Section 5. 

The objective in Sections 4.2 and 4.3 is to determine 
weights w, to satisfy the calibration equation 
>; WX, = Xu xX,, then use them to form the calibration 
estimator of Y as a AL = us;,Y;, Which we can confront 
with the unbiased Horvitz-Thompson estimator by writing 
You, =Yar +Ys(w, —4)¥%- It follows that the bias of 
Your iS E(You,)-Y =E(X,(w, —4,)y;,). Meeting the 
objective of near design unbiasedness requires 
E(X, (vw, — d,)y,) * 9, whatever the y-variable. Evidently, 
the calibration should strive for small deviations w, — d,. 

The objective “calibration for consistency with known 
population auxiliary totals” can be realized in many ways. 
We can construct many sets of weights calibrated to the 
known >) x,. This section examines this proliferation from 
two perspectives noted in the literature: the minimum 
distance method and the instrumental vector method. Y et 
another construction of a variety of calibrated weights is 
proposed in Demnati and Rao (2004). 


4.2 The minimum distance method 


In this method, the calibration sets out to modify the 
initial weights d, =1/7, into new weights w,, determined 
to “be close to” the d,. To this end, consider the distance 
function G,(w,d), defined for every w>0, such that 
G,(w, d) 20, G,(d,d)=0, differentiable with respect to 
w, strictly convex, with continuous derivative 
g,(w,d)=0G,(w,d)/dw such that g,(d,d)=0. 
Usually the distance function is chosen such that 
g,(w, d) = g(w/d)/q,, where the g, are suitably chosen 
positive scale factors, g(-) is a function of a single 
argument, continuous, strictly increasing, with 
g(1)=0, g(l)=1. Let F(u)=g '(u) be the inverse 
function of  g(-). Minimizing the total distance 
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>.G,(w,,d,) subject to the calibration equation 
~. WX, =XyX, leads to w, =d,F(q,x,), where 2 is 
obtained as the solution (assuming one exists) of 


nedyX, i (GX Nae ese 


The weights have an optimality property, because a duly 
specified objective function is minimized, but it is a “weak 
optimality” in the sense that there are many possible 
specifications of the distance function and the scale factors 
Wk: 

Much attention has focused on the distance function 
G,(w,,4,) =(w, -d,)°/2d,q,. It gives g,(w,,d4,)= 
(w, /d,-1)/q,3 g(w/d)=w/d-1; F(u)=g'(u)=1tu. 
The term “the linear case” is thus appropriate. The task is 
then to minimize the “chi-square distance” 
Ys (wy, =, ) 1 200Gi ne SUDIECE 10. WX Xe 
Equation (4.1) reads }),d,x,(1+9,x,4) = Xv X,, which is 
easily solved for 4. The resulting estimator of Y = Yy y, 1s 
je =>,w,y, with weights w, =d,g, given by (3.3). 
That is, Yo, = Yong as given by (3.2), and the residuals 
that determine the asymptotic variance are 
E, = y, —X;,By., a8 given in Section 3.2. Some negative 
weights w, may occur. 

The linear GREG estimator implies weights that happen 
to be calibrated (to },x,), and the opposite side of the 
same coin says that the linear case for calibration (with chi- 
square distance) brings the linear GREG estimator. The 
tendency in some articles and applications to intertwine 
GREG thinking and calibration thinking stems from this 
fact. Many successful applications of the use of auxiliary 
information stem, in any case, from this linearity on both 
sides of the coin. The Canadian Labour Force Survey is an 
example, and an interesting recent development for that 
survey is the use of composite estimators, with part of the 
information coming from the survey results in previous 
months, as described in Fuller and Rao (2001). 

The calibration equation is satisfied for any choice of the 
positive scale factors g, in (4.1). A simple choice is g, =1 
for all &. But it is not always the preferred choice. For 
example, if there is a single, always positive auxiliary 
variable, and x, =x,, then many will intuitively expect 
You, =D,Wpy, to deliver the usual ratio estimator 
Lu % (Ls 4, ) (Xs 4,%,), and it does, but by taking 
Gia OL Oe 

Another distance function of considerable interest 1s 
G,(w,, d,) = {w, log(w,/d,)-—w, +d,}/q,. Tt leads to 
F(u) = g '(uw) =exp(u), “the exponential case”. Then (4.1) 
reads >,.d, x, exp(q,X,4) = Xyx,. Numeric methods are 
required to solve for 4A, to obtain the weights 
w, = d, exp(q,x,). No negative weights w, will occur. 

Deville and Sarndal (1992) show that a variety of 
distance functions satisfying mild conditions will generate 


(4.1) 
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asymptotically equivalent calibration estimators. Alternative 
distance functions are compared in Deville, Sarndal and 
Sautory (1993), Singh and Mohl (1996), Stukel, Hidirogiou 
and Sarndal (1996). Some distance functions will guarantee 
weights falling within specified bounds, so as to rule out too 
large or too small (negative) weights. Changes in the 
distance function will often have minor effect only on the 
variance of the calibration estimator Y.,, =Y,w,,, even 
if the sample size is rather small. Questions about the 
existence of a solution to the calibration equation are 
discussed in Théberge (2000). 


4.3 The instrument vector method 


An alternative to distance minimization is_ the 
instrumental vector method, considered in Deville (1998), 
Estevao and Sarndal (2000, 2006) and Kott (2006). It can 
also generate many alternative sets of weights calibrated to 
the same information. 

We can consider weights of the form w, =d,F(Z,), 
where z, is a vector with values defined for kes and 
sharing the dimension of the specified auxiliary vector x,, 
and the vector 2% is determined from the calibration 
equation >,w,X, =dyX,. The function F(-) plays the 
same role as in the distance minimization method; several 
choices F(-) are of interest, for example, F(u)=1+u 
and F'(u) = exp(u). 

Opting for the linear function F(uw)=1+u, we have 
w, =d,(1+N2z,). It is an easy exercise to determine A to 
satisfy the calibration equation > ,w,x, =dyXx,. The 
resulting calibration estimator is 


‘ans = prays w, =a, (1+, ), 


MS (oo% x Dechy: ) (wees ia 


Whatever the choice of z,, the weights w, = 
d,(1+'z,) satisfy the calibration equation. The standard 
choice is z, =x,. In particular, setting z, =q,x,, for 
specified g,, gives the weights (3.3). 

Even “deliberately awkward choices” for z, give 
surprisingly good results. For example, let x, be a single 
continuous auxiliary variable, and z, =c,x?'. Suppose 
p=3, and c, =1 for 4 elements only, chosen at random 
from n=100 elements in a realized sample s, and c, =0 
for the remaining 96. The  near-unbiasedness of 
You, =D, 4, (1+4'z,)y, is still present. Even with such a 
Sparse z-vector, the increase in variance, relative to better 
choices of z,, may not be excessive. 

When both sampling design and x-vector are fixed, 
Estevao and Sarndal (2004) and Kott (2004) note that there 
is an asymptotically optimal z-vector given by 


Zp = Lo, = di alan — d,,)X, 


(4.2) 


Survey Methodology, December 2007 


where d,, is the inverse of the second order inclusion 
probability m,,=P(k &es), assumed strictly positive. 
The resulting — calibration _ estimator, a= 
>.4,(1+2'z,)y¥,, 18 essentially the “randomization- 
optimal estimator” due originally to Montanari (1987) and 
discussed by many since then. 

Andersson and Thorburn (2005) view the question from 
the opposite direction and ask: In the minimum distance 
method, can a distance function be specified such that its 
minimization will deliver the randomization-optimal 
estimator? They do find this distance; not entirely 
surprisingly, it is related to (but not identical to) the chi- 
square distance. 


4.4 Does calibration need an explicitly stated model? 


The calibration approach as presented in Sections 4.2 and 
4.3 proceeds by simply computing the weights that 
reproduce the specified auxiliary totals. There is no explicit 
assisting model, unless one were to insist that picking 
certain variables for inclusion in the vector x, amounts to a 
serious modeling effort. Instead, the weights are justified 
primarily by their consistency with the stated controls. Early 
contributions reflect this attitude, from Deming (1943), and 
continuing with Alexander (1987), Zieschang (1990) and 
others. This begs the question: Is it nevertheless important to 
motivate such “model-free calibration” with an explicit 
model statement? It is true that statisticians are trained to 
think in terms of models, and they feel more or less 
compelled to always have a. statistical procedure 
accompanied by a model statement. It may indeed have 
some pedagogical merit, also in explaining calibration, to 
state the associated relationship of y to x, even if it is as 
simple as a standard linear model. 

But will a stated model help the users and practitioners 
better understand the calibration approach? To most of them 
the approach is perfectly clear and transparent anyway. 
They need no other justification than the consistency with 
stated controls. Will a search for “the true model with the 
true variance structure” bring significantly better accuracy 
for the bulk of the many estimates produced in a large 
government survey? It is unlikely. 

The next section deals with model-calibration. For that 
variety, proposed by Wu and Sitter (2001), modeling has 
indeed an explicit and prominent role. These authors call the 
linear calibration estimator, Y.,, =X, w,), with weights 
w, given by (3.3), “a routine application without 
modeling”. The description is appropriate in that all that is 
necessary is to identify the x-variables with their known 
population totals. 


4.55 Model-calibration 


The idea of model-calibration is proposed in Wu and 
Sitter (2001) and pursued further in Wu (2003) and 
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Montanari and Ranalli (2003, 2005). The motivating factor 
is that complete auxiliary information allows a more 
effective use of the x, known for every k e¢U_ than what is 
possible in model-free calibration, where a known total 
Yu, 1s sufficient. The weights are required to be 
consistent with the computable population total of the 
predictions ,, derived via an appropriate model 
formulation. Thus the weight system may not be consistent 
with the known population total of each auxiliary variable, 
unless there is special provision to retain this property. 
Model-calibration still satisfies all three parts, (a) to (c), of 
the definition of calibration in Section 1.1; in particular, the 
estimators are nearly design unbiased. 

Consider a non-linear assisting model of the type (3.4). 
We estimate the unknown parameter 0 by 6, leading to 
fitted values py, =f, =u(X,, 6) computed with the aid of 
the x, known for all k €U. It follows that the population 
size N is known and should be brought to play a significant 
role in the calibration. If minimum chi-square distance is 
used, we find the weights of the model-calibration 
estimator Vrs aL = Dos Wey by minimizing 
¥.(W, -d,)° /(2d,q,), for specified g,, and d, =1/n,, 
subject to the calibration equations 


Da a a ie aves 


For simplicity, let us take g, =1 for all k; we derive the 
calibrated weights, rearrange terms and find that the model- 
calibration estimator can be written as 


Vea 7 NW o.a ie Py — Dou Bat 
where Vesa = DAM SLs 4 Jy = VAY, / Ls and 
Be =/DructW Fuad)! >be, —Pia)- 


The regression implied by B,., is one of observed y- 
values on predicted y-values. The idea of this regression 
would hardly occur to the modeler is his/her attempts to 
structure the relation between y, and x,, but it proves 
effective in building the calibration estimator. Wu and Sitter 
(2001) present evidence that 


(Qucar —Y/N =(>1,4,£, - >, &)/ N+0,(0") 


Withynese He= vernon Bee aowhertih Bas 
Crips tye) Delve sbed and By = Lub, /N. 
The coefficient B, may not be near one even in large 
samples. It expresses a regression of y, on its assisting 
model mean pt, ="(x,,B). That is, Yc, can be viewed 
as a regression estimator that uses the model expectation 1, 
as the auxiliary variable, leaving £, as the residuals that 


determine the asymptotic variance of Yuicay - 


(4.3) 


(4.4) 
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How does this asymptotic variance compare with that of 
the non-linear GREG construction (3.1) for the same non- 
linear assisting model and the same , = f1,? Formula (3.1) 
implies a slope equal to unity in the regression between y, 
and f, =1,; viewed in that light, Y..¢ is a difference 
estimator rather than a regression estimator and hence less 
sensitive to the pattern in the data. The non-linear GREG 
Y.erq is in general less efficient than Yucat: (It is of course 
possible to modify Yopgg to also account for the 
information contained in the known population size N.) 

On the other hand, compared with the linear (model-free) 


calibration estimator Y.,, =X,W,y, With weights as in 
(3.3), the model-calibration estimator Y.j.,, given by (4.4) 
may have a considerable variance advantage but implies a 
loss of the practical advantages of a consistency with the 
known population total }{,x, and a multi-purpose weight 
system applicable to all y-variables. The y-values in (4.4) are 
linearly weighted, but the weights now also depend on the y- 
values. It is thus debatable if Yuya, is a bona fide 
calibration estimator. 

In an empirical study, Wu and Sitter (2001) compare 
Ye AL = Los” Y;, calibrated according to (4.3), with the 
non-linear GREG, Yona =o Ver Dedy yy = J, ) given by 
(3.1), for the same non-linear assisting model and same 
}, =f,. The study confirms that Y,,.,, has a variance 
advantage over the non-linear Y.,..,. They created a finite 
population U of size N=2,000 with values 
(Ves %), kK = 1c. 20007 such tat logGy.) =. dca teste: 
the 2,000 values x, are realizations of the Gamma(1,1) 
random variable, and ¢, is a normally distributed error. The 
auxiliary information consists of the population size N and 
the known values x, for k =1,...,2,000. Repeated simple 
random samples of size m = 100 were taken; the assisting 
model for both estimators was the log-linear 
E.(y,\%) =, With log(u,)=a+Px,. This model was 
fit for each sample, using pseudo-maximum quasi- 
likelihood estimation. The fitted values }, = exp(Q+ Bx,) 
were used to form both Yio4, and Yoaug- The simulation 
variance was markedly lower for Y.je,,- (The linear GREG 
(3.2), identical to the model-free calibration estimator, was 
also included in the Wu and Sitter study; not surprisingly, it 
is even less efficient than the non-linear GREG, under the 
strongly non-linear relationship imposed in _ their 
experiment.) 

Montanari and Ranalli (2005) provide further evidence, 
for several artificially created populations, on the 
comparison between Yc, and the non-linear Yonnc- 
Their assisting model, y,=p,+¢€,, 1s fitted via 
nonparametric regression (local polynomial smoothing), 
yielding predictions p, =fi, for k @U. With this type of 
model fit, the predictions }, =i, are highly accurate. Not 


surprisingly, the model-calibration estimator Yyca, 
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achieves only marginal improvement over the non-linear 
vf GREG ° 

We can summarize the calibration approach as follows: 
The estimator of Y =}, y, has the linearly weighted form 
Y=>,w,y,. In linear (model-free) calibration, the 
calibration equation reads >},w,x, = DyX,; a known 
population auxiliary total ),x, is required, but complete 
auxiliary information (known x, for all keU) is not 
required; the same weights can be applied to all y-variables 
(multi-purpose weighting); the estimator is identical to the 
linear GREG estimator (but derived by different reasoning). 
In model-calibration, the assisting model mean p, is non- 
linear in x,; complete auxiliary information is usually 
required; the calibration constraints include the equation 
=“, =Ly V,; the weights w, depend on the y,- 
values, implying a loss of the multi-purpose property. 


5. Computational aspects, extreme weights 
and outliers 


The computation of calibrated weights raises important 
practical issues, discussed in a number of papers. All 
computation must proceed smoothly and routinely in the 
large scale statistics production of a national statistical 
agency. Undesirable (or unduly variable) weights should be 
avoided. Many practitioners support the reasonable 
requirement that all weights be positive (even greater than 
unity) and that very large weights should be avoided. 

A few of the weights computed according to (3.2) can 
turn out to be quite large or negative. Huang and Fuller 
(1978) and Park and Fuller (2005) proposed methods to 
avoid undesirable weights. 

In the distance minimization method, the distance 
function can be formulated so that negative weights are 
excluded, while still satisfying the given calibration 
equations. The software CALMAR (Deville, Sarndal and 
Sautory 1993) allows several distance functions of this kind. 
An expended version, CALMAR2, is described in 
LeGuennec and Sautory (2002). Other statistical agencies 
have developed their own software for weight computation. 
Among those are GES (Statistics Canada), CLAN97 
(Statistics Sweden), Bascula 4.0 (Central Bureau of 
Statistics, The Netherlands), g-CALIB-S (Statistics 
Belgium). These strive, in different ways, to resolve the 
computational issues arising. The user needs to consult the 
users’ guide in each particular case to see exactly how the 
computational issues, including an avoidance of undesirable 
weights, are handled. 

GES uses mathematical programming to minimize the 
chi-square distance, subject to the calibration constraints as 
well as to individual bounds on the weights, so that they will 
satisfy A, <w, < B, for specified 4,, B,. Bascula 4.0 is 
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described in Nieuwenbroek and Boonstra (2002). The 
software g-CALIB-S, described in Vanderhoeft, Waeytens 
and Museux (2001), Vanderhoeft (2001), uses generalized 
inverse (the Moore—Penrose) for the weight computation; 
consequently one need not be concerned about a possible 
redundancy in the auxiliary information. 

In Bankier, Houle and Luc (1997) the objective is two- 
fold: to keep the computed weights within desirable bounds, 
and to drop some x-variables to remove near-linear 
dependencies. Isaki, Tsay and Fuller (2004) consider 
quadratic programming to obtain both household weights 
and person weights that lie within specified bounds. 

An intervention with the weights (so as to get rid of 
undesirable weight values) raises the question how far one 
can deviate from the design weights d, without 
compromising the desirable feature of nearly design 
unbiased estimation. An idea that has been tried is to modify 
the set of constraints so that tolerances are respected for the 
difference between the estimator for the auxiliary variables 
and the corresponding known population totals. Hence, 
Chambers (1996) minimizes a “cost-ridged loss function”. 

Outlying values in the auxiliary variables may be a cause 
of extreme weights. Calibration in the presence of outliers is 
discussed in Duchesne (1999). His technique of “robust 
calibration” may introduce a certain bias in the estimates; it 
may, however, be more than offset by a reduction in 
variance. 

When the set of constraints is extended to make the 
weights restricted to specified intervals, a solution to the 
optimization problem is not guaranteed. The existence of a 
solution is considered in Théberge (2000), who also 
proposes methods for dealing with outliers. 


6. Calibration estimation for more 
complex parameters 


The calibration approach adapts itself to the estimation of 
more complex parameters than a population total. Examples 
are reviewed in this section. Single phase sampling and full 
response continue to be assumed; the notation remains as in 
Section 2. One example is the estimation of population 
quantiles (Section 6.1), another is the estimation of 
functions of totals (Section 6.2). Other examples in this 
category, not reviewed here, are Théberge (1999), for the 
estimation of bilinear parameters, and Tracy, Singh and 
Arnab (2003), for calibration with respect to second order 
moments. 


6.1 Calibration for estimation of quantiles 


The median and other quantiles of the finite population 
are important descriptive measures, especially in economic 
surveys. To estimate quantiles, the fimite population 
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distribution function must first be estimated. Before 
calibration became popular, several papers considered the 
estimation of quantiles, with or without the use of auxiliary 
information. More recent articles have turned to the 
calibration approach for the same purpose, including 
Kovacevic (1997), Wu and Sitter (2001), Ren (2002), Tillé 
(2002), Harms (2003), Harms and Duchesne (2006) and 
Rueda et al. (2007). As these papers illustrate, there is more 
than one way to implement the calibration approach. The 
non-smooth character of the finite population distribution 
function causes certain complexities; these are resolved by 
different authors in different ways. 

Let A(-) denote the Heaviside function, defined for all 
real z so that A(z) =1 if z=>0 and A(z)=0 if z<0. The 
unknown distribution function of the study variable y is 

F,Q=—Y, AC-») (6.1) 

The a -quantile of the finite population is defined as 
Q,,, =inf{t|F,(¢) 2 a. The auxiliary variable x,, taking 
values x,, has the distribution function 'F AEE = 
(UNS A(t —x,) with a -quantile denoted on he 
j=1,2,...,J. A natural estimator of F(t) based on the 
design weibits d, ss is 


=, )- 


EG) = Fe 


A calibration estimator F(t) of takes the form 


Fon = Dama) (6.2) 
3 Mk 


where the weights w, are suitably calibrated to a specified 
auxiliary information; then from F\¢a; (t) we obtain the 
a -quantile estimator as ay = inf {t| F, coal Ge OL tes 
formula analogous to (6.2) holds for F, CAL (0). 

Without explicit reference to any model, Harms and 
Duchesne (2006) specify the information available for 
calibration as a known population size, N, and known 
population quantiles QO. for 7 =1,2,..., J. The complete 
auxiliary information, ‘with values Xia ite, See 
known for keU, is not required. (But in practice, the 
complete information would usually be necessary, because 
accurate quantiles of several x-variables are not likely to be 
importable from outside sources.) They determine the w, to 
minimize the chi-square distance ¥,(w, —d,)” /2d,q,, for 
specified q,, subject to the calibration equations 


a = N; Chenite = aa =1, suse 


for suitably defined estimates O; CAL, o: Now, if we were to 
specify Se CAL.a = = inf {t| F. CAL (t)>c}, then it is in general 
not possible to find an exact solution of the calibration 
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problem as stated. Instead, Harms and Duchesne substitute 
smoothed estimators, called “interpolated distribution 
estimators’, of the distribution functions 
E.(®); f=1,2, 2450 Theyereplaces: AG@)" by “ar slightly 
modified function. Weights w, can now be obtained, as 
well as a corresponding estimated distribution function 
ig aL(4); finally, Q,,, 1s estimated as on = Ns a (cx). 

The resulting calibrated weights w, allow us to retrieve 
the known population quantiles of the auxiliary variables. 
This is reassuring; one would expect such weights to 
produce reasonable estimators for the quantiles of the study 
variable y. Moreover, in the case of a single scalar auxiliary 
variable x, the resulting calibration estimator delivers exact 
population quantiles for y when the relationship between y 
and x is exactly linear, that is, when y, =B x, forall k eU. 
An idea involving smoothed distribution functions is also 
used in Tillé (2002). 

The computationally simpler method of Rueda ef al. 
(2007) is an application of model-calibration, in that they 
calibrate with respect to a population total of predicted y- 
values. Complete auxiliary information is required. Using 
the known x,, compute first the linear predictions 
Srp xo fors Vite Leen cwithizipie.O3,4 gare s 
(%,4,9,X,¥;,), Where d, =1/1m, andthe q, are specified 
scale factors. The weights w, are obtained by minimizing 
the chi-square distance subject to calibration equations 
stated in terms of the predictions, so as to have consistency 
at J arbitrarily chosen points ¢,, j =1,..., J: 


i - 
Re wAt, -y,)=h@)7J=1..7 


where F’,(¢,) is the finite population distribution function 
of the predictions ),, evaluated at ¢,. It is suggested that a 
fairly small number of arbitrarily selected points ¢; may 
suffice, say less than 10. Once the w, are determined, the 
a-quantile estimate is obtained from 
Fi cart) = U/N)X, wA(t— y; ). 

Quantile estimation provides a good illustration that the 
calibration approach can be carried out in more than one 
way when somewhat more complex parameters are being 
estimated. Both methods mentioned give nearly design 
unbiased estimation. The Harms and Duchesne (2006) 
weights are multi-purpose, independent of the y-variable; by 
contrast, the method of Rueda ef al. (2007) requires a new 
set of weights for every new y-variable. Empirical evidence, 
by simulation, suggests that both methods compare 
favourably with the earlier quantile estimation methods, not 
based explicitly on calibration thinking (but on the same 
auxiliary information). 

An extension of the calibration approach to the 
estimation of other complex parameters, such as the Gini 
coefficient, is sketched in Harms and Duchesne (2006). 
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6.2 Calibration for other complex parameters 


Plikusas (2006), and Krapavickaité and Plikusas (2005) 
examine calibration estimation of certain functions of 
population totals. (Their term “non-linear calibration” 
signifies “non-linear function of totals”; I do not use it here.) 
A simple example is the estimation of a ratio of two totals, 
R= XuyY%/Xu Y2,., Where y,, and y,, are the values for 
element k of the variables y, and y,, respectively. (The 
distribution function (6.1) is in effect also of ratio type, with 
y>, =1, and N= 1 as the denominator total.) These 
authors examine the calibration estimator 
Roa = Xs Ny /LXisWpV2,- Its weights w,, common to 
the numerator and the denominator, are determined by 
calibration to auxiliary information stated as follows: There 
is one auxiliary variable, x,,, for y,,, and another, x,,, for 
y,,; the ratio of totals R, =Xyx,/Xy x, 18 a known 
value, by a complete enumeration at a previous occasion or 
from some other accurate source. The proposed calibration 
equation is >),w,e, =0, where e, = x,, —R)x>,. Because 
ve, =9, the weights, by minimum chi-square distance, 
are 


w, =d, f = pada \(X het ye ey | 


These weights correctly retrieve the known ratio value 
R,; setting y,, =», and y,, = x5, 11 Ro,,, we have 


ie Wein Dig Wee 
a R — 


eee ee fe ee 
oe MEX Do Mean 


The empirical evidence in Plikusas (2006), and 
Krapavickaité and Plikusas (2005) suggests that their 
calibration estimator compares favourably (lower variance, 
while maintaining near design unbiasedness) with other 
estimators, derived through other arguments than cali- 
bration, while relying on the same auxiliary information. 


=0. 


7. Calibration contrasted with other approaches 


As many have noted, users view calibration as a simple 
and convincing way to incorporating auxiliary information, 
for simple parameters (Section 4), as for more complex 
parameters such as quantiles, ratios and others (Section 6). 
Simplicity and practicality are undeniable advantages, but 
aside from that, is calibration also “theoretically superior’? 
Are there instances where calibration can be shown to give 
more accurate and/or more satisfactory answers on 
questions of importance, when contrasted with other design- 
based approaches? 

Section 4.5 gave one indication that calibration thinking 
may have an advantage over GREG thinking, in that model- 
calibration may give more precise estimates than the 
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non-linear GREG, for the same assisting model. The 
following Section 7.1 gives another example where 
calibration reasoning and GREG reasoning give diverging 
answers, with an advantage for the calibration method. 


7.1 An example in domain estimation 


The example in this section, from Estevao and Sarndal 
(2004), shows, for a simple practical situation, a conflict 
between the results of GREG thinking and calibration 
thinking. The context is the estimation of the y-total for a 
sub-population (a domain). 

A probability sample s is drawn from U= 
{1, eee; ices N}; the known design weights are 
d, =\/n,. Let U, be a domain; U,cU. The-domain 
indicator is 6,, With value 6,, =1 if keU, and 6,, =0 
if not. The target of estimation is the domain total 
Yo= duVys where, y,, =, and, y, 1s observed for 
kes. The Horvitz-Thompson estimator Y,. = ,d,¥,; 
although design unbiased, has low precision, especially if 
the domain is small; the use of auxiliary information will 
bring improvement. An auxiliary vector value x, is 
specified for every k €U. 

As is frequently the case in practice, the elements 
belonging to a domain of interest are not identified in the 
sampling frame. (If they are, some very powerful 
information is available from the start, but frequently real 
world conditions are not that favourable.) But suppose 
elements in a larger group U, are identifiable; 
U, <U_. CU. For example, suppose y is “income” and 
U. a professional group specified for the persons listed in 
the frame, while U, is a professional sub-group not 
identified in the frame. We can identify the sample subsets 
Sc=sAUe and s,=sAU,, and we can benefit from 
knowing the total }y,x,,, estimable without bias by 
>; 4,X¢,, where x, =5,,x;,, and 6,, 1s the information 
group indicator: 6,, =1 if keU, and 6,, =0 if not. The 
domain auxiliary total >, x,, 1s unavailable, because U, is 
not identified. Calibration to satisfy ¥,w,x,,=Duxy 


gives the nearly design unbiased estimator 
ei ea where, — od, (PEN Zee with 
Pe) ae Mod xe). (ax re The 


asymptotically optimal instrument for the given vector x, 1s 
(see Section 4.3) z, =Z,q = 4, Yue,(d,d, — dy )X cy: 

By contrast, regression thinking for the same auxiliary 
information leads to YVigerg =Ns4 Va t Du XQ 7 
&;4,X¢,)'B;.7, also nearly design unbiased, where the 
regression coefficient B,., =(X54,%,X;,) Ls, %X,),_ is the 
result of a weighted least squares fit at a suitable level, using 
all (when §=s) or part (when 5s) of the data points 
(y,,.X,) available for k es. 

For example, the modeller may opt for a regression fit 


“extending beyond the domain” (so that 55> 5,=sU,), 
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in an attempt to borrow strength for Y,,er¢ by letting it 
depend also on y-data from outside the domain. By contrast, 
Ya at felies exclusively on y-data in the domain, and this is 
in effect better. Estevao and Sarndal (2004) show that 
Pe eavithag =Z c, has smaller (asymptotic) variance 
than Y, ¢eeg, no matter how § is chosen. Bringing in y-data 
from the outside does not help; calibration thinking and 


regression thinking do not agree. 


8. Calibration estimation in the presence 
of composite information 


As the preceding sections have shown, many papers 
choose to study estimation for direct, single phase sampling 
of elements, without any nonresponse. The information 
available for calibration is simple; the k:th element of the 
finite population U LD MMOL 4 N} has an associated 
auxiliary vector value x,. 

However, in an important category of situations, the 
auxiliary information has composite structure. The 
complexity of the information increases with that of the 
sampling design. In designs with two or more phases, or in 
two or more stages, the information is typically composed 
of more than one component, reflecting the features of the 
design. The information is stated in terms of more than one 
auxiliary vector. For example, in two-stage sampling, some 
information may be available about the first stage sampling 
units (the clusters), other information about the second stage 
units (the elements). 

Consequently, estimation by calibration (or by any 
alternative method) must take the composite structure of the 
information systematically into account. The total infor- 
mation has several pieces; the calibration can be done in 
more than one way. All relevant pieces should be taken into 
account, for best possible accuracy in the estimates. To 
accomplish this in a general or “optimal” way is not a trivial 
task. Calibration reasoning offers one way. 

Regression reasoning, with a duly formulated assisting 
model, is an alternative way, but it will strike some users as 
more roundabout. Hence, surveys that allow composite 
auxiliary information bring further perspectives on the 
contrast between calibration thinking and GREG thinking. 

Two-phase sampling and two-stage sampling are 
discussed in this section. Another example of composite 
information occurs for nonresponse bias adjustment, as 
discussed in Section 9. 

Another aspect of composite information occurs when 
the objective is to combine information from several 
surveys. This, too, can be a way to add strength and improve 
accuracy of the estimates. It is a motivating factor (in 
addition to the user oriented motive to achieve consistency 
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among surveys) in the previously mentioned repeated 
weighting methodology of the Dutch statistical agency. 
Combined auxiliary information for GREG estimation is 
considered in Merkouris (2004). 


8.1 Composite information for two-phase sampling 
designs 

Double sampling refers to designs involving two 
probability samples, s, and s,, from the same population 
U =({l,...,k,...,. N}. Auxiliary data may be recorded for 
both U and s,, the study variable values y, are recorded 
only for k es, with an objective to estimate Y = >i y,. 
Hidiroglou (2001) distinguishes several kinds of double 
sampling: In the nested case (traditional two phase 
sampling), the first phase sample s, is drawn from U, the 
second phase sample s, is a sub-sample from s,, so that 
UDs,>58,. Two non-nested cases can be distinguished: 
In the first of these, s, is drawn from the frame U,; s, 
from the frame U,, where U, and U, cover the same 
population U; the sampling units may be defined 
differently for the two frames. In the second non-nested 
case, s, and s, are drawn independently from U. 

To illustrate how composite information intervenes in the 
estimation, consider the nested case. The design weights are 
dy =1/ tyy.-(5, sampled from.); .d5..=1/ Atop = Tr, 
in sub-sampling s, from s,). The combined design weight 
is d, =d,,d),. The basic unbiased estimator Y = dis, TeVe 
can be improved by a use of auxiliary information, specified 
here at two levels: 


Population level: The vector value x,, 1s known (given in 
the frame) for every k €U, thus known for every kes, 
and for every k €s,; > X,, 1S a known population vector 
total; 


First sample level: The vector value x,, is known 
(observed) for every ke s,, and thereby known for every 
k es,; the unknown total },x,, is estimated without 
bias by Y, dy, Xp¢- 


How do we best take this composite information into 
account? In an adaptation of GREG thinking, Sarndal and 
Swensson (1987) formulated two linear assisting models, 
the first one stated in terms of the x,, -vector, the other one 
also brings in the x,, -vector. The two models are fitted; the 
resulting predictions, of two kinds, are used to create an 
appropriate GREG estimator Yon, of Y =Du y,- 

Dupont (1995) makes the important point that the given 
composite information invites “two different natural 
approaches”: Besides the GREG approach, there is a 
calibration approach that will deliver final weights w, for a 
calibration estimator Y..,, =X, w,);,. It is of interest to 
compare the results of the two approaches. Both of them 
allow more than one option: In the GREG approach, there 
are alterative ways of formulating the linear assisting 
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models with their respective variance structures. In the 
calibration approach, alternative formulations of the 
calibration equations are possible. 

For example, a two-step calibration option is as follows: 
First find intermediate weights w, to satisfy 
Ds, MeXix = Ly Xz; then use the weights w,, in the second 
step to compute the final weights w, to satisfy 


» Xi, 
2. oka 2. SES) ie ee 
= 


where X, is the combined auxiliary vector 


Alternatively, in a single step option, we determine the 
w, directly to satisfy 
eu Xi j 


W,xX, = 
me kk pet 


The final weights w, are in general not identical in the 
two options. Suppose that > x,, is an imported x, -total. 
At closer look, the two-step option requires more extensive 
information, because individually known values x,, are 
required for k € s,, whereas it is sufficient in the single step 
option that they be available for kes,. Some variance 
advantage may thus be expected from the two-step option, 
since }, W,X,, 18 often more accurate (as an estimator of 
Yu%,) than Y,d,,xX,, in the single step procedure. 
Nevertheless, this anticipation is not always confirmed; the 
single step method can be better, as when x, and x, are 
weakly correlated. 

Dupont (1995) and Hidiroglou and S&rndal (1998) 
examine links that exist, not surprisingly, between the two 
approaches. A GREG estimator, derived from assisting 
models with specific variance structures, may be identical to 
calibration estimator, if the weights of the latter are 
calibrated in a certain way. In other cases, differences may 
be small. 

The efficiency of different options depends in rather 
subtle ways on the pattern of correlation among y,, x,, and 
x,,- For example, to what extent do x, and x, 
complement each other, to what extent are they substitutes 
for one another? In the GREG approach, it is difficult or 
even futile to pinpoint a variance structure that truly 
captures a “reality’ behind the data. The calibration 
approach is more direct. Some of its possibilities are 
explored in Estevao and Sarndal (2002, 2006). 


8.2 Composite information in two-stage sampling 
designs 


The traditional two-stage sampling set-up (clusters 
sampled at stage one, elements sub-sampled within selected 


Survey Methodology, December 2007 


clusters in stage two) has in common with two-phase 
sampling that the total information may have more than one 
component. There may exist (a) information at the cluster 
level (about the clusters); (b) information at the element 
level for all clusters; (c) information at the element level for 
the selected clusters only. Here again, authors are of two 
different orientations: some exploit the information via 
calibration thinking, others follow the GREG thinking route. 

Estevao and Sarndal (2006) develop calibration 
estimation for the traditional two-stage set-up, with 
composite information specified as follows: (i) for the 
cluster population U;, there is a known total Diy, X,.);, 
where X,.); is a vector value associated with the cluster U,, 
for i¢U,; (ii) for the population of elements U = 
View, U,, there is a known total },x,, where the vector 
value x, is associated with the element ke U. Suppose 
both cluster statistics and element statistics are to be 
produced in the survey: Both the cluster population total 
Y= Xv,M%); and the element population total Y = Xy ¥, 
are to be estimated. 

If no relation is imposed between cluster weights w,, and 
element weights w,, the former are calibrated to satisfy 
Ys, MuXey¢ = Lu, Xie): the latter to satisfy 
>; WX, = Ly X;,. (Here, s, is the sample of clusters from 
U,;s; is the sample of elements from the cluster U,; 
s=U,..,5; 38 the entire sample of elements.) Then 
a AL = Ls WiV(o)i estimates the cluster population total 
Y, and Yo., =X,”%,y, estimates the element population 
total Y. 

Integrated weighting is often used in practice: A 
convenient relationship is imposed between the cluster 
weight w,, and the weights w, for the elements within the 
selected cluster. Two forms of integrated weighting are 
discussed in Estevao and Sarndal (2006). 

One of these is to impose w, =d,,w,, where d,, is the 
inverse of the probability of selecting element & within 
cluster i. (For example, in single stage cluster sampling, 
when all elements & in a sampled cluster are selected, then 
d,, =1. Consequently w,=w, is imposed, and _ all 
elements in the cluster receive the same weight for 
computing element statistics, and that same weight is also 
used for computing cluster statistics.) The calibration 
equation Ds WX, = Dy X% then reads 
Ds, My Les, AX, = Lu X,- The cluster weights w,, are now 
derived by minimizing >, (w, —d,,)/d,, subject to the 
calibration equation that takes both kinds of information into 
account: 


Wy X(c)i X(0); 
ds ii"(c) ts (c) | (8.1) 


ate Wi Dus Ay Xy ey 


113 


Once the w,, are determined, the element weights 
w, =d,,W;, follow. 

Another reasonable integrated weighting is to impose 
Xs,W, =N jw, For example, for single stage cluster 
sampling it implies that the cluster weight w,, is the average 
of the element weights w, in that cluster. 

Two-stage sampling is also the topic in Kim, Breidt and 
Opsomer (2005). They assume auxiliary information for 
clusters, via a single quantitative cluster variable x,.),, but 
none for elements. They develop and examine a GREG type 
estimator of the element total Y= Vu Ver 
Y =Yiev, hy + Dies, 4% —fi;), where 7, is design unbiased 
for the cluster total ¢; = Xi, y,, and fi; is obtained by local 
polynomial regression fit. The estimator can be expressed 
on the linearly weighted form, with weights that turn out to 
be calibrated to the population totals of powers of the cluster 
variable x;.);. 


8.3. Household weighting and person weighting 


Some important social surveys set the objective to 
produce both household estimates and person estimates; 
some study variables are household (cluster) variables, 
others are person (element) variables. Consequently, a 
number of papers have addressed the situation with single 
stage cluster sampling (d,,;=1) and the integrated 
weighting that gives all members of a selected household 
equal weight, a weight also used for producing household 
statistics. A general solution for this weighting problem, 
when both household information and person information 
are specified, is to obtain the household weights w,, 
calibrated as in equation (8.1) with d,,=1, then take 
W, = W;. 

Several articles focus on auxiliary vector values x, 
attributed to persons. Alexander (1987) derives weights by 
minimizing chi-square distance, whereas Lemaitre and 
Dufour (1987) and Niewenbrook (1993) derive the 
integrated weights via a GREG estimator. The Lemaitre and 
Dufour technique proceeds by an indirect construction of an 
“equal shares auxiliary vector value” for all persons in a 
selected household; their result is derivable from the direct 
procedure in Section 8.2. 

The household-weighting/person-weighting question is 
revisited in more recent papers. Some authors display 
calibration thinking, others GREG thinking. Isaki, Tsay and 
Fuller (2004) formulate the problem as one of calibrated 
weighting; their weights respect both household controls 
and person controls; no explicit assisting models are 
formulated. By contrast, Steel and Clark (2007) proceed by 
the GREG approach, with linear assisting model statements 
and accompanying variance structures. 


Statistics Canada, Catalogue No. 12-001-XPB 


114 Sarndal: The calibration approach in survey theory and practice 


9. Calibration for nonresponse adjustment 
9.1 Traditional adjustment for nonresponse 


The context of many good theory articles is the simple 
one of Section 2, which includes total absence of 
nonresponse. It is good theory for conditions that seldom or 
never occur. (As an author of papers in that stream, I am not 
without guilt.) Practically all surveys encounter non- 
response; although undesirable, it is a natural feature, and 
theory should incorporate it, from the outset, via a 
perspective of selection in two phases. 

In many surveys, nonresponse rates are extremely high 
today, compared with what they were 40 years ago, that is, 
so low that one could essentially ignore the problem. Today, 
survey sampling theory needs more and more to address the 
damaging consequences of nonresponse. In particular, one 
pressing objective is to examine the bias and to try to reduce 
it as far as possible. 

A probability sample s is drawn from U= 
{1,2,..., k,..., Nt; the known design weight of element & 
is d, =1/nx,. Nonresponse occurs, leaving a response set r, 
a subset of s; the study variable value y, is observed for 
k er only. The unknown response probability of element k 
is Pr(k er|s) =9,. The unbiased estimator Y=y, dO, Y;, 
is ruled out because », = 1/0, is unknown. To keep the idea 
of a linearly weighted sum, how do we then construct the 
weights? Unit nonresponse adjustment by weighting, based 
on “nonresponse modeling’, has a long history. Calibration 
offers a newer perspective. 

In what we may call “the traditional procedure’, the 
probability design weights d, =1/m, are first adjusted for 
nonresponse and possibly for other imperfections such as 
outliers. The information used for this step is often a 
grouping of the sampled elements. Finally, if reliable 
population totals are accessible, the adjusted design weights 
are subjected to a calibration with respect to those totals. 

The methodology of the Labour Force Survey of Canada, 
described in Statistics Canada (1998), exemplifies this 
widespread practice. A (modified) design weight is first 
computed for a given household, as the product of three 
factors. The product of the design weight and a nonresponse 
adjustment factor is called the sub-weight. The sub-weights 
are subjected in the final step to a calibration with respect to 
postcensal, highly accurate estimates of population by age 
group, sex and sub-provincial regions. The final weights 
meet the desirable objective of consistency, in regions 
within a province, with the postcensal estimates. The 
nonresponse bias remaining in the resulting estimates is 
unknown but believed to be modest. 

The traditional procedure is embodied in the estimator 
type Y=y,d,(1/ 0, )y,, Where 0, has been estimated by 
0 , im a preliminary step, using response (propensity) 


Statistics Canada, Catalogue No. 12-001-XPB 


modeling. What theory demands of the statistician is not an 
easy task, namely, to formulate “the true response model’, 
capable of providing accurate, non-biasing values Q ,- But 
the factors 1/6 , are applied in many surveys in an uncritical 
and mechanical fashion, for example, by straight expansion 
within the strata already used for sample selection. 

The traditional procedure is apparent for example in 
Ekholm and Laaksonen (1991) and in Rizzo, Kalton and 
Brick (1996). 

Practitioners often act as if the resulting Y= 
yy del / 6 ,)¥, (following a more or less probing response 
modeling trying to get the 6 ,) 18 essentially unbiased, 
something which it is not (unless the ideal model happens to 
be specified); one acts (for purposes of variance estimation, 
for example) as if 1,0 , 1s the true selection probability of 
element k in a single step of selection, something which it is 
definitely not. This practice, with roots in the idyllic past, 
becomes more and more vulnerable as nonresponse rates 
continue their surreptitious climb. 

An unavoidable bias results from the replacement of 0, 
by 6, Decades ago, when the typical nonresponse was but 
a few per cent, it was defendable to ignore this bias, but with 
today’s galloping nonresponse rates, the practice becomes 
untenable. By first principles, unbiased estimation is the 
goal, not an estimation where the squared bias is a 
dominating (and unknown) contributor to the Mean Squared 
Error. We must resolve to limit the bias as much as possible. 
Calibration reasoning can help in constructing an auxiliary 
vector that meets this objective. 


9.2 Calibration for nonresponse bias adjustment 


More or less contrasting with the traditional procedure 
are a number of recent papers that emphasize calibration 
reasoning to achieve the nonresponse adjustment. Recent 
references are Deville (1998, 2002), Ardilly (2006), chapter 
3, Skinner (1998), Folsom and Singh (2000), Fuller (2002), 
Lundstr6m and Sarndal (1999), Sarndal and Lundstrém 
(2005) and Kott (2006). 

Calibration reasoning starts by assessing the total 
available auxiliary information: information at the sample 
level (auxiliary variable values observed for respondents 
and for nonrespondents), information at the population level 
(known population auxiliary totals). The objective is to 
make the best of the two sources combined, so as to reduce 
both bias and variance. The design weights are modified, in 
one or two calibration steps, to make them reflect (i) the 
outcome of the response phase, (ii) the individual 
characteristics of the respondents, and (iii) the specified 
auxiliary information. The information can be summarized 
as follows: 


Survey Methodology, December 2007 


Population level: The vector value x, is known (specified 
in the frame) for every k €U, thus known for every k es 
and for every k er; Yyx, isa known population total. 


Sample level: The vector value x, is known (observed) for 
every kes, and thereby known for every ker; the 
unknown total ¥,, x; is estimated without bias by ©, d,x;. 


Calibration on this composite information can be done in 
two steps (intermediate weights computed first, then used in 
the second step to produce final weights) or directly in one 
single step. Modest differences only are expected in bias 
and variance of the estimates. In the single step option, the 
combined auxiliary vector and the corresponding infor- 


mation are 
=-(Sp-(E} 


Using an extension of the instrument vector method in 
Section 4.3, we seek calibrated weights w, =d,v,, where 
v, =F (iz,) is the nonresponse adjustment factor, with a 
vector A determined through the calibration equation 
~,w,x, =X; the resulting calibration estimator is 
Your =D, Wy, It is enough to specify the instrument 
vector value z, for respondents only; z, is allowed to 
differ from x,. The function F(-) has the same role as in 
Sections 4.2 and 4.3. Here, F'(4'z,) implicitly estimates the 
inverse response probability, », =1/0, as Deville (2002), 
Dupont (1995), Kott (2006) have noted. In the linear case, 
F(u)=1+u, and v, =14+)1'Z,, with 
M= (Oy b Hae d,X,)'(, dia, xe: 

The variables that make up the vector x,, although 
observed for sampled elements only, can be crucially 
important for the reduction of nonresponse bias (although 
less important than the x, for the reduction of variance). 
For example, Beaumont (2005b) discusses data collection 
process variables can be used in building the x, vector 
component. 


9.3 Building the auxiliary vector 


In some surveys, there are many potential auxiliary 
variables, as pointed out for example by Rizzo, Kalton and 
Brick (1996), and Sarndal and Lundstr6ém (2005). For 
example, for surveys on households and individuals in 
Scandinavia, a supply of potential auxiliary variables can be 
derived from a matching of existing high quality 
administrative registers. A decision then has to be made 
which of these variables should be selected for inclusion in 
the auxiliary vector x, to make it as effective as possible, 
for bias reduction in particular. As Rizzo, Kalton and Brick 
(1996) point out, “the choice of auxiliary variables is ... 
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probably more important than the choice of the weighting 
methodology.” 

Let us examine the bias, when z, =x,. We need to 
compare alternative X, -vectors in order to finally settle one 
likely to yield the smallest bias. (I assume X, to be such 
that p’x, =1 for all k and some constant vector p, as is the 
case for many X, -vectors, including the examples | to 5 at 
the beginning of Section 2.) A close approximation to the 
bias of Yo,, is obtained by Taylor linearization as 
nearbias(Yox, ) = (Yu X;,)(By., —B,), which involves a 
difference between the weighted regression coefficient 
By. =(579, x ual >v 9,X,y, and the unweighted one, 
B, = (uy X;,%; Ss, x,y,). Unless all 8, are equal, the 
bias caused by the difference in the two regression vectors 
may be substantial, even though xX, is a seemingly “good 
auxiliary vector”. This expression for nearbias is given in 
Samdal and Lundstrém (2005); related bias expressions, 
under different conditions, are found in Bethlehem (1988) 
and Fuller et al. (1994). We can write alternatively 
nearbias(You, )= Xu (0,M, —Dy,, where M, = (XyX,)' 
(Sy 9,x,x;,) x,- In comparing possible alternatives x,, a 
convenient benchmark is the “primitive auxiliary vector’, 
Kod, BetiOles Call eek em WHICH IVES oY ate 
Ne Ney. valaweeanercs 1.) isesthes number jor 
respondents, with nearbias(N y,)= N(¥y.9 — Yy), where 

9 =2y9,y,/dXy9, and y, = Xyy,/N. The ratio 


yy OM —l)y, 
N(Vu-9 — Vy) 


i ias(Y, 
relbias(Yox,,) = een ca) = 
nearbias(N y, ) 


measures how well a candidate vector x, succeeds in 
controlling the bias, when compared with the primitive 
vector. We seek an x, that will give a small bias. But 
relbias(Y, at) 1S not a computable bias indicator; it depends 
on unobserved y, and on unobservable 8,. We need a 
computable indicator that approximates relbias(Y. aentk! 
depends on the x-vector but not on the y-variables, of 
which the survey may have many. 

It is easy to see that relbias(Ya., ) =0 if an ideal 
(probably non-existent) x-vector could be constructed such 
that >, =1/0, =x, for all keU and some constant 
vector 2. 

For an x-vector that can actually be formed in the survey, 
we can at least obtain predictions of the o,: Determine 
to minimize Yy9,(, —A’x,)’; we find dda! where 
M., =(XyX,)'(Xv 9,x,x,) 5 the predicted value of $, is 
Oxy aye «. =M,. The (theta-weighted) first and second 
moment of the predictions $,,,=M, are, respectively, 
My. a ae aU ee re 1a) aed 


= — >, 9. (My, - My 9)” = (1/0, (My -1/8,,) 


0-5 
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where My =>¥,M,/N. Sarndal and Lundstrém (2007) 
show that relbias(Y.,,) and Q have under certain 
conditions an approximately linear relationship, 


Q 


relbias(Yox, ) 
0 


where $,, =Xyo,/N and Q, =(1/6,)(o, -1/6,,) is the 
maximum value of Q. Thus if Q were computable, it could 
serve as an indicator for comparing the different candidate 
x, -vectors. A computable analogue O of Q is instead 
obtained as the variance of the corresponding sample-based 
predictions = = Ve ie ay dex) 
(5.0%, x, ), k= 7, so teat 


R l — = 7 m 
Q Sap conte (m, Td) ~" M,.q(M.. 4 ~M,.4) 
AK 


where 


aD ap i, ee A 
da SEE 5 Pes Sond: BEE! See : 


We expect re/bias to decrease in a roughly linear fashion as 
O increases; thus, independently of the y-variables, O may 
be used as a tool for ranking different x-vectors in regard to 
their capacity of reduce the bias. 

We can use Q asa tool to select x-variables for inclusion 
in the x,-vector, for example, by stepwise forward 
selection, so that variables are added to x, one ata time, the 
variable to enter in a given step being the one that gives the 
largest increment in O. The method is described in Sarndal 
and Lundstrém (2007). 


10. Calibration to account for other 
non-sampling error 


Nonresponse errors are critical determinants of the 
quality of published statistics. When we examine how the 
calibration approach may intervene in the treatment other 
sources of non-sampling error than the nonresponse, the 
literature to date is not surprisingly much less extensive. 
However, several authors sketch a calibration reasoning to 
also incorporate frame errors, measurement errors, and 
outliers. Calibration has a potential to provide a more 
general theory for estimation in surveys, encompassing the 
various non-sampling errors. 

As Deville (2004) points out (my translation from the 
French): “The concept of calibration lends itself to be 
applied with ease and efficiency to a great variety of 
problems in survey sampling. Its scope goes beyond that of 
regression estimation, an idea to which some seem to wish 
to reduce the calibration approach”. He provides a brief 
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sketch of how a treatment of several of the nonresponse 
errors may be accomplished under the caption of calibration 
thinking. 

Folsom and Singh (2000) present a weight calibration 
method using what they call the generalized exponential 
model (GEM). It deals with three aspects: extreme value 
treatment, nonresponse adjustment and calibration through 
post-stratification. The method provides built-in control for 
extreme values. Calibration to treat both coverage errors 
(under- or over-coverage of the frame) and nonresponse is 
discussed in Sarndal and Lundstr6m (2005) and Kott 
(2006). Skinner (1998) discusses uses of calibration in the 
presence of nonresponse and measurement error. He notes 
something which remains a challenge almost ten years later: 
“More research is needed to investigate the properties of 
calibration estimates in the presence of non-sampling 
errors”. 


11. Conclusion 


If I am to select one issue for a concluding reflection on 
the contents of this paper, let me focus on the concept of 
auxiliary information. It is the pivotal concept in the paper. 
If there is not auxiliary information, there is no calibration 
approach; there is nothing to calibrate on. I noted on the 
other hand that regression (GREG) estimation is an 
alternative but different thought process for putting auxiliary 
information to work in the estimation. 

An objective in this paper has been to give a portrait of 
the two types of reasoning, and I made a point of noting 
how the thinking differs. I gave examples where essentially 
the same estimation objective is tackled by some authors 
through calibration reasoning, by others through GREG 
reasoning (or at least primarily by one or the other type). 
The respective estimators that they end up recommending 
may or may not agree. Whether or not the difference has 
significant consequence (for variance, for bias, for practical 
matters such as consistency and transparency) depends on 
the situation. This paper may help contributing an awareness 
of the separation existing between two thought processes 
that have guided researchers survey sampling. 
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Weighting for two-phase surveyed data 


Seppo Laaksonen ' 


Abstract 


Missingness may occur in various forms. In this paper, we consider unit non-response, and hence make attempts for 
adjustments by appropriate weighting. Our empirical case concerns two-phase sampling so that first, a large sample survey 
was conducted using a fairly general questionnaire. At the end of this contact the interviewer asked whether the respondent 
was willing to participate in the second phase survey with a more detailed questionnaire concentrating on some themes of 
the first survey. This procedure leads to three missingness mechanisms. Our problem is how to weight the second survey 
respondents as correctly as possible so that the results from this survey are consistent with those obtained with the first phase 
survey. The paper first analyses missingness differences in these three steps using a human survey dataset, and then 
compares different weighting approaches. Our recommendation is that all available auxiliary data should have been used in 
the best way. This works well with a mixture of the two classic methods that first exploits response propensity weighting 
and then calibrates these weights to the known population distributions. 


Key Words: Calibration; Internal vs. external auxiliary variables; Response propensity modelling method; Selective 


sub-sample. 


1. Introduction 


A standard survey is composed of one step or phase. This 
means that the potential survey units have first been chosen 
using a certain sampling design, and attempts have then 
been made to contact and interview these units as well as 
possible. However, varying amounts of non-response or 
other forms of missingness or data deficiencies will have 
occurred. Usually, addressing missingness leads to the 
application of post-survey adjustment methods of varying 
degrees of sophistication, which take advantage of available 
auxiliary variables. The auxiliary variables may be derived 
from various sources (see e.g., Laaksonen 1999 or an 
extended version in Laaksonen 2006b, and Lundstrém and 
Sarndal 2001), but for weighting purposes these are usually 
taken from registers, or other administrative sources or 
surveys. These kinds of auxiliary variables could be called 
external, if we want to distinguish them from internal 
auxiliary variables, that is, internal in the sense that the 
information is derived from the same survey or from its 
predecessor. 

Internal auxiliary variables are especially used for 
imputations when the values for some items are missing. 
Such variables are also extensively used in panel surveys if 
a certain respondent has responded in one wave but not in 
another. In panel surveys, internal auxiliary information 
may be used both for weighting adjustments and for 
imputations. 

This paper does not concern a standard survey as 
described above. It discusses two special characteristics: 


(i) A survey consisting of two (or, in some sense, 
three) steps or phases. The first phase is like a 
standard survey, in which a certain number of 
units respond. For the second phase, we only 
keep in the frame the respondents who are willing 
to contribute to a more detailed survey. This leads 
first to having to distinguish such first-phase 
respondents who say they are willing to partic- 
ipate voluntarily, on one hand, and those of these 
respondents who actually answer the second 
questionnaire. This being the case, the latter sub- 
group will thus respond to both questionnaires. 


(ii) When making attempts for post-survey adjust- 
ments, we will have the option to exploit both 
external and internal auxiliary variables in the 
second phase. The internal variables will thus be 
available from the first survey. 


We are only considering weighting adjustments although 
some of our ideas could be used in imputations, too. The 
approach of the paper has not been much used in cross- 
sectional surveys although the same problem has been often 
met. For instance, it is typical that a face-to-face survey is 
conducted first and that at the end of it the interviewer will 
request the interviewee to respond to a self-administered 
additional questionnaire and, if the respondent is willing, the 
interviewer will hand out the questionnaire immediately for 
filling in, or submit it later to the volunteer. In both cases, 
the answers will be received by post or email. A recent 
example of this type is the European Social Survey (ESS) in 
which the supplementary questionnaire concerns especially 
the values of life (see www.Europeansocialsurvey.com). 
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Naturally, not all face-to-face respondents fill in this 
questionnaire. 

The second-phase questions do not necessarily concern 
the same topic as those of the core questionnaire. Another 
usual strategy is to start with a broad questionnaire on a 
specific subject and then continue in the second phase with 
more detailed questions about the same subject. There can 
be some feedback from the first phase to the second 
questionnaire, and even to a sample that depends on the 
distribution of the key variables of the first survey (this is an 
example of adaptive sampling). This is often the case where 
there is not much experience in this type of a survey. Thus, 
the first survey also plays the role of a pilot survey. The so- 
called master samples are also close to this idea whereby the 
first phase survey (including a sample from administrative 
sources, like a micro census in some countries) may be 
conducted for constructing an appropriate sampling frame. 
In this case, the variables of the master sample are fairly 
limited, including usually only factual or background 
information. 

In the case of master sampling, the objective is that the 
constructed sampling frame is a good representation of the 
target population. Hence, when going to a sample, this 
frame information can be well used as auxiliary data for 
editing, imputing and weighting of the real survey (second 
phase survey). Each real survey is thus a sub-sample of the 
master sample. We consider here a more complex case as 
illustrated by Table 1. 


Table 1 Illustration of initial sample and three follow-up 


datasets 
Sample First Phase Volunteers | Second Phase 
with Respondents for the Respondents 
Auxiliary | with Variables | Second Phase | with Variables 
Variables yi with Y, 
x (e.g., Health, Variables (e.g., Skating, 
(Gender, Age, Outdoor) ¥ Boating) 
Region, 
Season) Basic Weights Weights 
Weights for for 
Design for 8,481 Units | 5,480 Units 
Weights | 10,666 Units 
for 
12,554 Units 
without 
Overcoverage 


First, there is a standard sampling procedure including 
some auxiliary variables X. A fairly high response rate was 
obtained (10,666 out of 12,554 units, about 85%) for the 
first survey. Some attrition occurred due to the fact that all 
respondents were not willing to participate in the second 
survey (we now have 68% of the initial sample left). Due to 
a rather high non-response rate in this second phase (in spite 
of voluntariness), our remaining sub-sample covers only 44 


Statistics Canada, Catalogue No. 12-001-XPB 


Laaksonen: Weighting for two-phase surveyed data 


per cent of the initial sample. We now have the following 
three datasets available for the analysis: 


A. First-phase respondents with survey variables Y, 

B. Second-phase respondents with survey variables 
is 

C. Both first and second-phase respondents with 
survey variables Y, and Y,. 


Most users will receive both files, A and B, and they 
can merge these together and obtain file C if a common 
identifier is available. What does a user expect having 
received both data files? Naturally, that the estimate for the 
same parameter from both files is as identical as possible, 
that is, the results are consistent with each other. The user 
obviously understands that a certain parameter estimated 
from the smaller file C is less accurate than that estimated 
from a larger file. In principle, it is possible to impute the 
missing values for variables Y,, but we do not believe that it 
is possible to do this well, hence we approach this question 
by weighting. Our aim is to attempt to construct adjusted 
sampling weights for file B so that the analysis over 
variables Y, and Y, is as adequate as possible. 

Several strategies can be used for this kind of weighting. 
Useful general aspects have been presented by, among 
others, Kalton and Kasprzyk (1986), Little (1986), Sarndal, 
Swensson and Wretman (1992), Fuller, Loughin and Baker 
(1994), Wu and Sitter (2001), and Lundstrém and Sarndal 
(2001). If we assume that the missingness only depends on 
the sampling design, we can construct the weights for files 
A and B in the respective way. For example, if stratified 
random sampling has been applied, the same stratification 
would naturally be applied to both phases. In the case of 
post-stratification, an analogous strategy may be applied. 

In our particular example survey, the sampling frame 
contained the respondents of the first rotation group of the 
12 months of the Finnish LFS. Each monthly sample was 
drawn randomly. The LFS is based on simple random 
sampling, but due to nonresponse these weights were 
adjusted by a standard calibration technique (Deville, 
Sarndal and Sautory 1993) using gender, age group (six 
categories) and region (five categories) as auxiliary 
variables. Later, we refer to these as design weights. The 
basic weights for the first-phase respondents were 
constructed correspondingly adding variable season (4x2= 
8 categories over two years) to the pattern of auxiliary 
variables. The ‘season’ variable is rarely used in Finnish 
surveys but was here considered to be necessary due to the 
‘seasonal’ nature of the survey (see Section 2). The present 
paper does not consider this aspect in detail. The first three 
variables are usual in Finnish human surveys because such 
information can be validated well from updates in the 
population register. This being the case, we now presume 
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that we will have the best possible estimates for the first- 
phase respondents when using these adjusted calibrated 
weights. In any case, we have no further access to other 
possible useful information from external sources. 

It is possible to use estimates obtained from the first- 
phase respondents as benchmarking information to calibrate 
the second-phase weights. This strategy is not difficult as 
such, but all variables X and as many of the variables Y, 
as possible should be included in this process. Moreover, as 
precise aggregate or domain levels as possible should be 
found for this strategy, which is not an easy job and hence 
not attempted in this study. Nevertheless, it is not 
guaranteed that the estimates for other aggregates will be 
unbiased enough (results in Laaksonen 1999 give some 
evidence for this conclusion). 

My proposed strategy is more straightforward and works 
without technical problems for all domains although it is not 
of course guaranteed that a possible bias will be 
substantially reduced for all domains. Thus, I have not tried 
any advanced calibration strategy, although this could be 
workable. I hope that other authors will show its possible 
benefits. A useful reference for them is the paper by Dupont 
(1995) that considers calibration of two-phase survey data, 
however without empirical evidence. It should be noted that 
I use calibration, but not a very advanced one (see Section 
8) 

The proposed methodology of this paper is largely based 
on a response propensity modelling that has been 
successfully used in other types of situations, see e.g., 
Ekholm and Laaksonen (1991), Laaksonen (1999), Duncan 
and Stasny (2001), and Laaksonen and Chambers (2006). 
The situation of Rizzo, Kalton and Brick (1996) is fairly 
close to the two-phase case of this paper, although it is 
concerned with a panel. Their methodology also has some 
similar features. In addition, a major difference concerns the 
response mechanism that here occurs in two steps, that is, 
both due to voluntariness and due to response in the second 
phase. We analyse these steps separately, too. Naturally, we 
compare the results obtained with alternative techniques. In 
Section 2, we briefly further describe our surveys and 
datasets, and Section 3 details the principles of our methods. 
Section 4 presents comparison results, and Section 5 draws 
a conclusion. 


2. Principles of the datasets 


The data are from a special survey conducted among 
Finnish citizens aged 15-74 years old (for more information, 
see Virtanen, Pouta, Sievénen and Laaksonen 2001). The 
topic concerns their leisure time activities, especially 
relating to outdoor hobbies and activities. First, a CAPI 
(computer-assisted personal interview) survey was 
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conducted, covering various leisure time and hobby 
questions such as cycling, motorcycling, walking, jogging, 
sailing, swimming, hunting, fishing, nature photography, 
skiing, skating and riding. In all cases, the reference period 
was the previous year. Second, at the end of this survey the 
respondents were asked whether they would be willing to 
receive a special postal survey questionnaire in which more 
detailed questions would be asked about some of these 
activities. This survey would be conducted in a few weeks’ 
time. 

The survey was conducted over two years (1998-2000) in 
order to reduce response and interviewing burden. Another 
reason for this was that since these activities are seasonal to 
some extent, the responses were expected to be seasonally 
influenced (e.g., responses to questions about skiing might 
be different in summer and winter). The initial sample size 
after the removal of overcoverage (104 units of over- 
coverage) was 12,554 individuals. 

We chose the following binary variables for our analysis 
presented in Section 4: Outdoors (person has performed 
regularly some outdoor activities in the nature), Health (is 
health good enough for outdoor activities?), Skiing, Fishing, 
Skating, Boating, Cycling and Jogging. In all cases, value = 
1 means that a person has engaged in the activity during the 
preceding year, and value = 0, respectively, the opposite. All 
these variables were included in the first-phase question- 
naire and we hence knew what to expect after the two 
consecutive phases. Note, that there are more complex 
variables in the data set but this simpler choice was made in 
order to interpret results more easily. The main conclusions 
are the same in the case of another choice. 

In Section 4, we present two types of comparison, (i) 
those based on known information from the first phase, and 
(1i) those not based on known information. In both cases, we 
can fortunately check how well we have succeeded in the 
reduction of bias since we actually know the ‘true’ (or best 
possible) estimates. In addition, we analyse some variables 
only included in the second questionnaire, but we cannot 
say definitely how well each method works in these cases. 
We do not present the latest results in detail but these were 
observed to behave similarly to those of the second 
approach. 


3. Response propensity modelling 
method and calibration 


This study comprises three steps with the following 
weighting specifications: 

First, well-designed calibrated sampling weights for the 
first phase respondents were created using the variables 
Region, Gender, Age group and Season (see also Section 1). 
Let us use symbol w, for these sampling weights for 
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respondent k. These weights thus are based on calibration, 
and also called ‘Basic’. Note that before this we have, 
before naturally, constructed design weights for the dataset, 
based on the stratified random sampling design. These are 
thus available for the non-respondents of the first phase, too. 

Next, we model voluntariness/response probabilities 
using the most common link function (Logit is not 
necessarily the best link function as learned from (2006a), 
but this is what we use here.), that is, logit = log(x/( — 
mt)), in which m is the binary response probability (either 
1 = volunteer vs. 0 =non-volunteer or 1 = respondent vs. 
0 =non-respondent) and the explanatory variables consist 
of variables X and some variables Y, that have been 
considered to be ‘good.’ The model gives the predicted 
response probabilities p, that are now used in the following 
way when constructing each particular adjusted sampling 
weight: 


w,(res) = Bice 
Here g.=a_ scaling factor which benchmarks the 
weights to certain known aggregates at level c. There are 
several alternatives for this benchmarking, but some type of 
calibration could be considered as a standard way. In this 
study we use post-stratified aggregates h (this being the 
cross-classified cell of all three X variables = Age group, 
Gender and Region, the whole number of cells = 
6*2*5 = 60) using the following straightforward technique 


WwW 
eke Dy k 


ye! Pe 


As already pointed out, the quality is high in Finland for 
these kinds of post-stratified aggregates but not necessarily 
for any other aggregates. 

Because we have two steps for the second phase, we 
have the following three model options that were all also 
used in Section 4: 


(a) Model for voluntariness 

(b) Model for the response given that the person 
volunteered (called also ‘TwoStep’). 

(c) Model for the response as one step (called also 
‘Direct’ and ‘OneStep’). 


Note that steps (a) and (b) together give the weights for 
file B. This leads to the following formulation (vol = 
volunteer; p, = estimated response probability at step 1, 
p> = estimated response probability at step 2; respectively 
for the scaling factors g, and g,): 


Step 1: w,(vol) = tape 
Pix 
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Step 2: w,(res) = etn) Zoy- 
Pr 


The correct sampling weights had to be used in each 
modelling task. For models (a) and (c) this meant weights 
w,, but for model (b) weights w,(vol). In our comparison 
tests we also modelled the first-phase response and here we 
used design weights. The use of weights in the modelling 
gives more correct estimates, since we are trying to make 
our analysis representative for the target population. In some 
cases, the influence of weights is substantial, like in business 
surveys where weights often vary more than in standard 
household surveys. In this case, the results between 
weighted and unweighted models were not highly different, 
although the weighted ones should be used (this is well 
justified in Laaksonen and Chambers 2006 in which the 
influence of weights is substantial; Rizzo etal. 1996 also 
use weights). The empirical results (estimates, their standard 
errors and response probabilities) for the weighted solutions 
are presented later in Section 4. 

In addition to our above key techniques, in the next 
Section we also use weights w, when providing our ‘best 
possible’ estimates for such parameters that are known, thus 
based on variables Y.. 

Moreover, we compare our specific results using post- 
stratified calibration only without modelling (we use also 
symbol ‘cal’ in the remaining sections). The latter could be 
interpreted as a very standard way of approaching the 
weighting problem (this was a house style prior the 
methodology proposed here). Note, however, that if a 
response model only includes the variables (and the same 
categories) used in_ post-stratification, the response 
propensity-based weights are exactly the same as obtained 
by post-stratification. 


4. Empirical results 


This Section presents results from different methods. 
First, we give results from different response models and 
then go on to compare different weights with each other, 
and at the end of the Section compare some parameter 
estimates based on different techniques. 


4.1 Models for voluntariness and response 


In order to fully understand the behaviour of missingness 
(due to non-response and voluntariness) in all three phases 
of the survey, we present in Table 2 results that are based on 
such auxiliary variables X that are available in each step 
(in practice, we also used the variable ‘season’ but do not 
include its effects in this analysis since it is not a key issue 
in this paper). 
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Logistic regressions using the three common explanatory variables in the three phases, that is, for the first phase 
respondents, for the voluntary respondents in the second phase and for the real respondents in the second phase. The 


estimates are odds ratios; their 95% confidence intervals are presented in parenthesis 


Explanatory variables and Model 1 Model 2a Model 3a Model 4a 
other statistics First phase response Voluntariness Response for volunteers Second phase response 
Gender (ref. Female) 
Male 0.71 0.84 0.75 0.77 
(0.65, 0.78) (0.76, 0.93) (0.68, 0.83) (0.71, 0.83) 
Age group (65+) 
24 and under 1.00 Bot, OFDi 1.49 
(0.83, 1.21) (4.65, 6.68) (0.41, 0.64) (1.285173) 
25-34 0.96 4.76 0.65 1.73 
(0.79, 1.15) (4.00, 4.81) (0.52, 0.81) (1.49, 2.00) 
35-44 0.85 4.08 0.64 G2) 
(0.71, 1.02) (3.46, 4.81) (0.52, 0.80) (1.40, 1.88) 
45-54 0.89 3.16 0.86 1.82 
(0.74, 1.07) (2.71, 3.69) (0.69, 1.06) (1-58, 2210) 
55-64 1.18 2.05 bas woTS 
(0.96, 1.45) (1.74, 2.41) (0.90, 1.47) (1.49, 2.04) 
Region (North) 
South-East 0.55 212 0.96 135 
(0.46, 0.66) (1.795250) (0.79, 1.16) (1, Lg 1455) 
South-West 0.76 1.83 1.04 135 
(0.64, 0.91) (L.57, 2.14) (OS641.25) (i.18.1,55) 
Mid-West 1.14 2.14 1.16 1.56 
(0.91, 1.42) (17 72-59) (0.93 1.43) (1.33,)1.83)) 
Mid-East 0.96 1.20 tS 1.19 
(0.78, 1.18) (1.01, 1.44) (0.92, 1.43) (1.02, 1.40) 
Number of observations 12,554 10,666 8,481 10,666 
-2 Log L 10,904 10,296 8,569 14,618 


There are many interesting outcomes in_ these 
consecutive missingness behaviour models. The results of 
the first survey are fairly ordinary, for example, men 
respond more poorly than women in both phases. The 
response propensities are also lower in the South than in the 
rest of the country. The differences between age groups are 
somewhat surprising since the middle-age groups respond 
most poorly. 

The voluntariness estimates are different. People in the 
Mid-East and North are the least willing to participate in the 
second survey, but the response premia given that a person 
is voluntary do not differ much. By age, it seems that 
younger people are more willing to participate but do not, 
nevertheless, respond very well. Older people, thus, seem in 
this sense to be more prepared to make a commitment than 
young people. However, we see clearly that the oldest ones 
will be under-represented without adjustments. 

When considering the two first internal auxiliary 
variables (Table 3), it is observed that the people who are 
not relatively healthy (variable Health) and who do not 


actively pursue recreation in nature (Outdoor), are not 
willing to receive any new questionnaire, either. This is seen 
from the very high odds ratios. Interestingly, the respective 
odds ratio for the variable Health is close to the one for the 
volunteers. This, thus, means that a non-healthy person is 
not very likely to volunteer, but if he/she does, she/he 
responds as well as a healthy one. The tendency is similar 
with the variable Outdoor. It should be noted that the non- 
healthy and non-outdoor domains are not very large and 
although their roles in the response propensity modelling are 
important, their impacts on the final estimates are not very 
dramatic (Section 4.3). 

When adding the other two internal auxiliary variables, 
that is, Skiing and Fishing, the same selectiveness continues 
although not as substantial. As a conclusion, we see clearly 
that the response mechanism of the second survey does not 
seem to be very non-informative. Consequently, it is 
expected that this leads to some effects on reweights and on 
survey estimates. These are considered in the next two sub- 
sections. 


Statistics Canada, Catalogue No. 12-001-XPB 


126 


Table 3 
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Logistic regressions using some auxiliary variables from the first phase respondents in addition to those used in Table 2. The model 
numbers in this table and in Table 2 correspond to each other so that the response variable and the datasets are the same 


Explanatory variables Model 2b Model 3b Model 4b Model 4c 
and other statistics Voluntariness Response for volunteers Second phase response Second phase response 
Gender (ref. Female) 
Male 0.94 0.77 0.82 0.75 
(0.85, 1.04) (0.69, 0.85) (0.75, 0.88) (0.68, 0.83) 
Age group (65+) 
24 and under 4.92 0:52 1.30 0.51 
(4.07, 5.97) (0.41, 0.65) (1212, 1052) (0.41, 0.64) 
25-34 3.83 0.65 1.46 0.65 
(3.18, 4.60) (0.52, 0.81) G25, 70) (0.52, 0.81) 
35-44 3.26 0.64 Bey 0.64 
(2.74, 3.88) (0.51, 0.80) (1.18, 1.58) (0.52, 0.80) 
45-54 D3) 0.85 1.56 0.86 
(2.20, 3.05) (0.68, 1.06) (1.34, 1.81) (0.69, 1.06) 
55-64 eS) 1.18 155 BAS 
(1.45, 2.05) (0.89, 1.46) (13231 30) (0.90, 1.47) 
Region (North) 
South-East 2.15 0.96 1.34 0.96 
CIS i255) (0.79, 1.16) (1.16, 1.54) (0.79, 1.16) 
South-West 1:92 1.04 1.36 1.04 
(1.64, 2.26) (0.86, 1.25) CEPgLT50) (0.86, 1.25) 
Mid-West 2.09 hehS 152 1.16 
(171, 254) (0.93, 1.43) (1.29;1.78) (0.93 1.43) 
Mid-East | a} ES 1.18 Lk 
(0.98, 1.41) (0.91, 1.42) (1.00, 1.38) (0.92, 1.43) 
Outdoor 3.04 1.24 1:93 ry 
(343,271) (1.43, 1.07) (2.15, 1.74) (1974159) 
Health 3.61 1.02 2:44 2.49 
(4.61, 2.82) (1.58, 0.66) (3.51, 2.09) (3.24, 1.92) 
Skiing 1.36 
(1.47, 1.25) 
Fishing L227 
(1.38, 1.17) 
Number of observations 10,666 8,481 10,666 10,666 
-2 Log L O24 8,560 14,342 14,244 


4.2 Comparison of different weights 


As already explained, we provided several weights. 
Table 4 gives a summary of these with descriptive statistics 
in order to explain the changes that occur after each 
adjustment operation. The design weights cannot be used in 
our comparisons since no data on variables Y are available 
for the initial sample. It is, however, illustrative to see that it 
has the lowest relative variation measured here with 1 + cv’ 
in which cv is the coefficient of variation. This formula is 
also used as an approximation of the design effect (DEFF). 
Rizzo et al. (1996) also use this indicator when comparing 
their weights. 

The changes are not dramatic in the first step, that is, 
from design weights to first-phase basic weights (except for 
the average that is related to decreasing counts), but in the 
following two steps the DEFFs are higher. We also see that 
the variation for both calibrated weights is lower than that 
for the respective response propensity-based weights. The 
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distribution for each weight is skewed to the right, least for 
the design weights, naturally. It is somewhat surprising that 
the skewness is the highest for the volunteer weights. More 
details about the weight distributions and the differences 
between the weights are presented in Figures | to 3. 

Figure 1 illustrates well how some weights have 
increased substantially due to the response propensity 
modelling (Model 2b). It is possible to look in detail to see 
which types of units are under the plots with high weight 
increase. For example, behind the separate left-side plots 
with RP weights higher than 700 are persons who are not 
healthy and do not engage much in outdoor activities but 
are, nevertheless, still in the volunteer data file. Similarly, 
we can find other interesting groups by using the results 
from the model estimations. However, the majority of the 
plots are in the same area and, consequently, less changes 
can be expected in the estimates than in the area with more 
substantial weight changes. 
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Table 4 Descriptive statistics for different sampling weights. RP = Response Propensity 


Weight 

Design Weight 

(Calibrated) Basic Weight 

Calibrated Weight 

RP Weight, Model 2b 

Calibrated Weight 

TwoStep RP Weight, Models 2b and 3b 
OneStep RP Weight, Model 4b 


Phase Unit size Average Skewness Lt cy 
Zero 12,658 308 0.94 1.30 
First 10,666 365 1.30 1.39 
Volunteers 8,481 460 US 1.63 
Volunteers 8,481 460 4.60 Late 
Second 5,480 712 1.64 1.62 
Second 5,480 eit. 3.60 1.84 
Second 5,480 TAZ 2.56 1.80 
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Figure 1 Scatter plot between the two volunteer 
weights 


TwoStep_Weight 


2,000 
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Figure 2 Scatter plot for second phase respondents 
between the calibrated weight and the two- 
step response propensity based weight 
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Figure 3 Scatter plot for second phase respondents 
between the two alternative response 
propensity modelling weights 


The dispersion in the Figure 2 scatter is somewhat 
stronger than that in Figure | but the profile is similar. 
Consequently, interesting sub-groups can be found behind 
distinct plots. 

Finally, Figure 3 compares the two alternative second 
phase weights with each other. This scatter differs 
considerably from the previous two, since the relationship is 
rather linear. The maximum values of the two-step weights 
are higher than those of the one-step weights, but the 
weights of many one-step weights are, however, clearly 
higher. For example, non-healthy people who, however, 
engage in outdoor activities receive relatively high one-step 
weights but there is no clear age effect. On the other hand, 
people with little outdoor activities in older age groups 
receive relatively high two-step weights but health does not 
relate to them. Nevertheless, it is not expected that there will 
be big differences in respective estimates although one of 
these two alternatives should have been introduced into use. 
If this choice were a simpler one, that is, one-step weighting, 
it would still be useful to analyse both steps and their 
response propensities separately in order to understand 
better the reasons for both types of missingness. 


4.3 Comparison of parameter estimates 


We have not been able to make complete simulation 
studies with different assumptions in order to analyse which 
type of method would be best in each particular case. 
Fortunately, we can get quite close to this by comparing the 
effects on the estimates from three different perspectives. 
First, we have prepared the response/voluntariness models 
by using both X and some Y, variables. Consequently, we 
know the ‘best’ parameter estimates based on these Y, 
values from the first survey. Second, we add auxiliary 
variables Y, in the model but exclude some Y, values from 
them. However, we know the ‘best’ values in these cases 
and thus can make exact comparisons. Third, we can 
compare some estimates that are not known in any way. In 
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this last case, we can only deduce which values might be the 
best. 

We present our explicit results based on the variables 
described in Section 2. Note that we do not consider it 
important to present standard errors for each estimate 
because we are concentrating on the biases in these 
estimates. However, it is good to notice that the standard 
errors are around 0.2-0.4 percentage points for the first- 
phase data set and around 0.3-0.5 percentage points for the 
second-phase data set (lowest always in Health, second in 
Jogging, and highest in Outdoor and Skiing). 

Figure 4 gives the results based on the weights without 
using any adjustment (that is often the case in practice, 
unfortunately). We see that the bias is substantial in most 
estimates, lowest in Jogging, which was not very actively 
practised when compared to Outdoor, for example. In 
general, most users are unhappy with such big biases that 
are statistically significant and highly significant except for 
Jogging in the second phase (e.g., the 95% confidence 
interval of the bias for Health is from 1.7 points to 2.3 
points). Here, as in later results, the bias means over- 
estimation so that while missingness increases the estimate 
becomes too high. The results without good adjustments 
will be too optimistic, that is, people seem to do ‘too much’ 
of all exercises. Note that the same tendency is obviously 
also in the first-phase estimates but we cannot justify this. 
There are surprising differences between those two 
estimates, sometimes the ‘volunteer’ data give a more 
biased result, sometimes it takes the second-phase respon- 
dents data. We do not interpret these in detail but naturally 
they reflect differences in missingness, and can be 
considered to be warnings for a user. 

For comparison, we show again in Figure 5 the same 
unadjusted results for volunteers as in Figure 4 but we have 
added the corresponding estimates based on post-stratified 
calibration and response propensity modelling. This graph 
clearly shows that post-stratification gives some benefit 
compared to the unadjusted solution. However, the response 
propensity method is the best in each case, and extremely 
good for Health and Outdoor that have been used as 
auxiliary variables in the supported models. 

Figures 6 and 7 concern the final-step estimates and are 
thus the most important. Figure 6 shows the same 
conclusion as Figure 5 in the sense that the response 
propensity technique is superior to post-stratified calibration 
although all differences are not statistically highly 
significant (especially Jogging). The difference between the 
one-step method and the two-step method is fairly small and 
the bias varies from one variable to the next. Hence, basing 
on this study, we cannot say which of these two 
specifications is better. 
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Figure 4 


Figure 5 


Figure 6 


a Volunteers 


| | Second-Phase 


Respondents 


Bias in estimates in percentage points based 
on unadjusted sampling weights for second- 
phase respondents and for volunteers 


- |B Post-Stratified 
| |D Response Propensity 


Bias in estimates in percentage points for 
volunteers based on unadjusted sampling 
weights (symbol = ‘Initial’), post-stratified 
calibration and _ response propensity 
method in which variables Outdoor and 
Health have been used as_ auxiliary 


variables (Model 2b in Table 3) 


| |B Post-Stratified 
— |@ Two-Step 


Bias in estimates in percentage points for 
respondents after both steps based on post- 
stratification, and the two _ response 
propensity methods, ie., ‘Two-step’ and 
‘One-step’ so that variables Outdoor and 
Health have been used as_ auxiliary 
variables. The two-step method is based on 
the two consecutive models (Models 2b and 
3b in Table 3) whereas the one-step method 
has been constructed direct to the second- 
phase respondents (Model 4b in Table 3) 
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Figure 7 presents some comparisons when the two new 
variables have been added to the response propensity model. 
The results are quite predictable since this reduces the bias 
in these estimates and in all other estimates to some extent 
as well. The bias is still too large in Boating and Biking in 
the opinion of many users, I suppose. We can reduce this 
bias, naturally, by adding new auxiliary variables to the 
model. How far could we go in this? This has not been 
examined further in this study. On the other hand, we have 
worse tools for reducing bias in such variables that have 
been based on the second survey only. We tested several 
such estimates and observed some changes in corresponding 
estimates, being of the same level as in the cases of Boating 
and Biking in Figure 7. In this case, however, we cannot 
check the bias. We can only believe basing on our previous 
exercises that these results are less biased than those based 
on more poorly adjusted ones. 


_ |G Post-Stratified 
- | Model 4b 


Figure 7 Bias in estimates in percentage points for 
respondents after both steps based on 
post-stratification, and the two ‘One-step’ 
response propensity methods so _ that 
variables Skiing and Fishing have also 
been used as auxiliary variables (Model 4c 
in Table 3). These are compared to those 
based on Model 4b 


5. Discussion 


The problem discussed in this paper is common in 
surveys. There are many surveys which are conducted in 
more than one step, and some inconsistencies have occurred 
between these surveys due to missingness and other 
discrepancies. An internationally well-known example is 
the European Social Survey (ESS) that includes two 
questionnaires, a core one and a supplementary one. The 
number of respondents is naturally smaller for the latter than 
for the former. This leads to some selectiveness, for 
example, responding to the second questionnaire being 
positively associated to political activity. This is awkward 
from the user’s point of view because an estimate based on a 
larger dataset differs from that based on a smaller one, 
although both concern the same variable and time period. 
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Similarly to the ESS, this study concerns two-phase 
surveyed data. The response rate in the second survey was 
substantially lower than in the ESS. The effect from 
selectiveness is also higher. Using the response propensity 
models we predicted this selectiveness and exploited the 
results in weighting adjustments, and as the final step we 
calibrated the sums of the weights to correspond to certain 
known population aggregates. This strategy aims at making 
the most of all available auxiliary information, derived both 
from registers and other external sources, and also of the 
previous phase of the survey at the micro level. 

In our example, the second phase of the survey 
comprised two different steps but only one data collection. 
The first step concerned willingness to participate 
voluntarily in the second phase of the survey, and the 
second step the actual survey participation of these 
volunteers, respectively. We examined both steps separately 
and found interesting information on their response 
mechanisms. Moreover, we used the results from this 
analysis for reweighting adjustments. For the sake of 
comparison, we looked at these both steps in one occasion 
and built a respective model, and continued the reweighting 
analogously. Finally, we compared the estimates. It was 
somewhat surprising that the two results differed quite little 
in our examples. This is, on the other hand, a good point, 
since it is easier to work with one step, and hence this could 
be introduced into use. 

We thus propose a certain methodology for two-phase 
sampling weighting, but cannot say definitely which 
specification would be the best in each particular case. Our 
methodology is quite easy to exploit, but the advantages 
from it depend naturally largely on the availability of good 
external and internal auxiliary data. If no direct auxiliary 
variables are available, it will not be clear how good the 
adjusted estimates will be. Our examples show that these 
will be easily less biased than the initial ones. However, our 
recommended technique seems to be somewhat 
conservative so that all the best adjusted estimates in our 
analysis are slightly overestimated although not statistically 
significantly. This is an interesting question for future 
research that is still needed especially because this problem 
is becoming more common in the survey world. Another 
interesting topic for future research is how to make an 
optimal choice of auxiliary variables in the two-phase 
survey setting. 
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Weighting in rotating samples: 
The SILC survey in France 


Pascal Ardilly and Pierre Lavallée ' 


Abstract 


The European Union’s Statistics on Income and Living Conditions (SILC) survey was introduced in 2004 as a replacement 
for the European Panel. It produces annual statistics on income distribution, poverty and social exclusion. First conducted in 
France in May 2004, it is a longitudinal survey of all individuals over the age of 15 in 16,000 dwellings selected from the 
master sample and the new-housing sample frame. All respondents are tracked over time, even when they move to a 
different dwelling. The survey also has to produce cross-sectional estimates of good quality. 


To limit the response burden, the sample design recommended by Eurostat is a rotation scheme consisting of four panels 
that remain in the sample for four years, with one panel replaced each year. France, however, decided to increase the panel 
duration to nine years. The rotating sample design meets the survey’s longitudinal and cross-sectional requirements, but it 


presents some weighting challenges. 


Following a review of the inference context of a longitudinal survey, the paper discusses the longitudinal and cross-sectional 
weighting, which are designed to produce approximately unbiased estimators. 


Key Words: Longitudinal survey; Panel; Weight share method; Longitudinal weighting; Cross-sectional weighting. 


1. Introduction 


Statistics on Income and Living Conditions (SILC) is a 
European survey that produces data on the income and 
living conditions of persons living in regular households 
(persons living in communal households are excluded). It 
was introduced in 2004 as a replacement for the European 
Panel. While it is a European Union (EU) survey and 
therefore under Eurostat responsibility, it is conducted inde- 
pendently in each EU member state. Hence, the member 
states - France in this case - are free to adjust the sample 
design suggested by Eurostat to meet their national 
requirements. The data are also processed by the individual 
member states, as is usually the case for Eurostat surveys in 
the EU. This article deals only with the SILC survey 
conducted in France, but it may also be of interest to other 
EU member states. 

SILC is a longitudinal survey conducted once a year in 
May. It focuses on individuals rather than households, and 
data are collected through personal interviews with every 
person in the sampled dwellings. SILC can be thought of as 
the European version of the Statistics Canada’s Survey of 
Labour and Income Dynamics (SLID) (see Lavallée 1995, 
and Lévesque and Franklin 2000). 

The SILC sample is rotating: each year, it is formed by 
combining nine panel subsamples selected under identical 
steady-state conditions, partly from the master sample and 
partly from the new-housing sample frame. The master 
sample and the new-housing sample frame are two dwelling 
frames constructed from the French census of population 


and the information and automated data processing system 
for dwelling and office space (SITADEL) respectively (see 
Ardilly 2006). 

Each incoming panel includes all individuals living in the 
selected dwellings. Surveying all members of the 
households living in the selected dwellings makes it 
possible to produce both individual-level and household- 
level estimates and helps keep collection costs down by 
maximizing the number of individuals reached in each 
contact. On the other hand, some of the estimates are 
narrower in scope, applying only to the population aged 16 
and over on December 31 of the survey year. 

Each year, one subsample is rotated out and replaced 
with another subsample. In the survey’s starting year, 2004, 
each subsample consisted of 1,780 dwellings (give or take a 
few units because of rounding). In the second and 
subsequent years (i.e., from 2005 on), the size of the year’s 
incoming subsample was 3,000 dwellings. Note that at the 
outset in 2004, the sample was 16,000 dwellings, divided 
into nine equal parts. One of those parts was surveyed only 
once (in 2004), another twice (2004 and 2005), a third three 
times (2004, 2005 and 2006), and so on. After the start-up 
phase, a given panel will be surveyed for nine consecutive 
years. During the start-up phase, which will end in 2012 
with the departure of the ninth and last subsample from the 
2004 selection, the subsamples will have been surveyed 
fewer than nine times. 

The sampling procedure itself is the standard method of 
selecting units from the master sample and the new- 
dwelling sample frame (see Ardilly 2006). In this case, no 
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category of individuals is overrepresented. The survey has a 
uniform sampling fraction - ignoring rounding - except for 
vacant rural dwellings and dwellings that were secondary 
residences in the 1999 census and became principal 
residences by survey date, which are traditionally under- 
represented. 

Under the collection process, each subsample is 
considered a true panel of individuals. Panel members who 
move are tracked, and their files are sent to the appropriate 
regional branch of INSEE. More details on SILC’s sample 
design are available in the November 17, 2003, issue of the 
Official Journal of the European Union and internal INSEE 
documents describing sampling practices in France. 

Since SILC is a longitudinal survey with panels that 
overlap in time, weighting the sample presents a special 
problem. This paper provides a detailed picture of the two 
types of weighting used for SILC. We will begin by 
discussing some general principles related to SILC’s sample 
design. Then we will examine longitudinal weighting, 
followed by cross-sectional weighting. 

Note that we will not consider the topics of non-response 
correction and estimate adjustment. Those issues are dealt 
with in the same way as they are generally for any other 
longitudinal survey, such as the SLID (see Lavallée 1995, 
and Lévesque and Franklin 2000). 


2. General principles 


2.1. Two approaches: The longitudinal view and the 
cross-sectional view 


Each year, we have a sample of fully panelized 
individuals, eight ninths of whom were interviewed at least 
once in previous years (barring non-response). 

Two types of parameters may be of interest: annual totals 
Y, (or their satellites), and changes in totals A,,,, between 
two years, consecutive or otherwise. For simplicity, we will 
confine ourselves to differences in totals between two 
consecutive years. When discussing changes, we have to be 
clear about the inference populations involved. We can look 
at the data in two different ways: either as populations that 
change over time - the cross-sectional approach - or as a 
fixed population - the longitudinal approach. If we let Q, be 
the entire in-scope population in year ¢, the annual total for 
year tis given by Y, = Yio Y', where Y,' is a variable of 
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LONGITUDINAL 
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interest measured for individual 7. When we look at change, 
we may want to estimate the difference A, 1 between the 
total Y,,, at +1 over Q,,, and the total Y, at t over ,, 
that is, Ma = Y,,,—Y,. This is a cross-sectional view. 
Alternatively, we may want to estimate the difference A, ,,, 
between the totals for the units that are common to 
populations Q,,, and ©, where the size difference 
between the two populations is due to their incoming units 


(births) and outgoing units (deaths). This is a longitudinal 


view. Let Q,,,,=Q,0Q,,;, the population that is 
common to ¢t and ¢+1. Then A,,,, is defined as A, ,,, = 
dieo (yn - Ve): 


t t+] 


The two approaches are illustrated in the diagrams 
below. The upper rectangle represents the entire population 
at time ¢, and the lower one represents the entire population 
at time ¢+1. The “minus” side represents deaths in the 
broad sense (persons who have died, emigrated, moved to a 
communal household, and so on), and the “plus” side 
represents births in the broad sense (newborns, new 
immigrants, persons who have become part of the survey 
population by passing an age threshold, and so on). The 
grey portion represents the inference population on each 
date. 


2.2 Surveys repeated over time and potential 
strategies 


The goal, of course, is to produce both longitudinal 
estimates and _ cross-sectional estimates. There are 
essentially three possible strategies: 


1. An “independent” sampling each year. In fact, 
because we have a master sample and a new- 
housing sample frame, the panels are selected from 
the same localities each year, and as a result, the 
subsamples are not truly independent. This solution 
can be improved for estimating changes. 


2. A fully panelized sampling, i.e., initial selection of 
a sample that is surveyed each year. This scenario 
presents a response burden problem, since the 
SILC survey is to continue indefinitely. It is 
therefore unrealistic. 


3. A rotational sample. This is the scenario that was 
chosen, because of its advantages in satisfying both 
longitudinal and cross-sectional goals. 


Wily 
Llllddddddddddddddidddddddd 
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The table below characterizes the three potential sample 
designs in terms of the two desired approaches. 


Sample TYPE CROSS-SECTIONAL LONGITUDINAL 
approach approach 
“Independent” each year CUSTOMARY POSSIBLE but less efficient 
Panel IMPOSSIBLE without a CUSTOMARY 
top-up sample 
Rotational POSSIBLE POSSIBLE 


The rotation strategy has four major advantages: 


i. It reduces the sampling error associated with 
measuring change (in principle, as do panels, 
though it is theoretically less efficient than a “pure” 
panel). 


ii. It has a smaller response burden than a “pure” 
panel. Under the circumstances, since France has a 
nine-year panel, this argument must be used with 
restraint. It is more persuasive in the scenario 
recommended by Eurostat, which consists of an 
annual survey for four consecutive years. 


ili. It takes into account very “naturally” how the 
population changes over time. This point will 
become clearer when we look at the coverage of 
new populations. 


iv. It reduces observation errors (as do panels). 


On the other hand, the strategy also has at least three 
weaknesses: 


i. Participants have to be tracked over time, which 
results in tracing costs and non-response due to 
moves. 


il. The length of the individual series is limited to nine 
years, which is substantial, though not as 
informative as a pure panel. 


ii. The longitudinal/cross-sectional weighting method 
is not straightforward. 


3. Longitudinal weighting 


This type of weighting is inherently somewhat easier to 
understand than cross-sectional weighting because there is 
no need to take account of how the population changes over 
time (except for “deaths”, units that leave the survey 
population over time, but they are not much of a technical 
problem). The idea behind longitudinal estimation, of 
course, is to make an inference based on a single population 
at an initial date. 

Clearly, the rotational nature of the sample design is what 
makes weighting difficult, since between two consecutive 
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years f and t+ 1, we have to deal with eight different panels, 
each selected from a different population (a population 
made up of individuals who are, naturally, different from 
year to year). If we were dealing with just one panel, we 
would only need to use the sampling weights associated 
with the panel members who were still in-scope on date ¢, 
since those weights are calculated once and for all at the 
time of selection and can be used to make inferences about 
the initial population each year throughout the panel’s life. 

The essential difficulty is to represent population Q, on 
date ¢ using eight panel subsamples selected on different 
dates and therefore from different populations. Intuitively, it 
makes sense that a given individual would have a 
probability of selection on date ¢ that would depend on the 
number of panel subsamples for which he or she could be 
chosen. For this discussion, it is assumed that there is no 
non-response. This situation can be expressed formally by 
letting a,, = a panel subsample to be surveyed in year f for 
the k" time, and s,,,, =Uti4,,- 


Note that we can write a,,; ,,, =@,, (Vt, Vk #9) since we 
are obliged to use each (non-outgoing) panel subsample in 
its entirety year after year. This is pictured below. 


t+1 


TIME 


The grey part represents s,,,,, which is the sample used 
in this longitudinal approach. It is from the individuals in 
S,, that we obtain both Y’ and Y/*', ie., information 
about individual i on dates ¢ and t+ 1 respectively. 

Suppose we have an individual 7 in , who is in-scope 
on date ¢. We denote as ZL, the number of years in 
{t—7,t—6,...,¢-—1,¢} during which individual i was in- 
scope and therefore had a chance of being selected as a 
member of an incoming panel. It is assumed here that each 
year, the sample frame covers the survey population exactly. 
We have JL, € {l, 2, 3,...,8}. In addition, we denote as K, 
the set of k indexes out of 1, 2, 3, ..., 8 for which ie€a,,. 
These are the numbers of the panels of which individual 7 is 
a member on date ¢. For all i in s,,,,, K; will be construed 
as a set containing at least one element. Most of the time, 
K, will in fact have only one index, but in some cases, it 
will have two or even more indexes. That will be the case if 
i is selected for a panel, he/she moves and his/her new 
dwelling is chosen for another panel in a subsequent year. 
Note that our scenario excludes the possibility of selecting a 
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given dwelling twice, since dwellings from the master 
sample and the new-housing sample frame are not supposed 
to be surveyed more than once. This is just a practical 
convention, however, as the theory can easily accommodate 
a system in which dwellings can be selected multiple times. 

Ifiea,,, let W,(t,k) be his/her “raw” sampling weight. 
In fact, it is the sampling weight of the dwelling in which i 
was living at the time he/she was chosen as a panel member, 
i.e., at the time of the annual selection from ©,_,,,;. This 
weighting system allows direct inference from subsample 
a,, to the entire population Q,,,,. In particular, 
Diea ,Wi(t,k) provides an unbiased estimate of the total 
number of in-scope individuals who are members of 
population Q, ,,,. For SILC in France, that total is roughly 
60 million. The longitudinal weight assigned to each 
individual i in s, ,,, will therefore be as follows: 


Wee = vy e W(t, k). (1) 


i keK, 


This equation is derived from the application of the 
weight share method (see Lavallée 1995, and Lavallée 
2002) in which the initial population (the population of 
sampling units) is defined as the union of the populations 
Q. 55-5 Q,,, Q, and the final population (the population 
of observation units) as ©,. This is illustrated in the 
diagram below; for greater clarity, only three of the initial 
subpopulations are shown. Clearly, the number of links is 
equal to L; (in this case, i has exactly eight links, while j 
must have fewer than eight because it does not appear in the 
oldest sample frames). In practice, it is realistic to proceed 
asit Oe te oe Ore OF We can wereavith 
nested populations, since all individuals who leave the 
survey population in the time before ¢ will not be part of 


Sirs 


<x << 


ON 


\ 


ny ~ 
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Equation (1) is the most general formula for the “raw” 
longitudinal weight. It can then be simplified for specific 
situations. For example, if we ignore the cases in which a 
panel member can be selected more than once, we have 


wot! a Ribtie. 9 
ae 2) 


1 

where W; is the weight of 7 relative to the one panel 
subsample of which he/she was a member on date ¢. In 
France’s case, because of the sample sizes involved, it 
seems quite appropriate to use that equation. If we assume 
that we are in the ideal position - though that seems 
simplistic in our circumstances - of having a population that 
does not change over time, we will have LZ; =8 for all 7. 
The population changes a great deal in nine years, but with 
shorter panel lives, the ideal case might be an acceptable 
approximation. Moreover, if the panels are selected with 
equal probabilities, W; will be equal to a constant W and we 
will have 


tt+l W 
W; i (3) 

Such a scenario is highly improbable in France’s case. 
First, up to 2012, the sample will contain subsamples with 
very different raw weights. Second, the sampling process is 
likely to focus on generating a predetermined number of 
dwellings (as the total number of dwellings increases), and 
not a constant sampling fraction. 

Note that equation (3) is intuitive. Ultimately, everything 
proceeds “as if’ any individual in the longitudinal sample 
S,,,, had a selection probability eight times the selection 
probability of each panel subsample s, ,,, that is part of. 

The foregoing applies to the survey in its steady state and 
must be adapted slightly during the start-up phase, i.e., until 
2012. The first longitudinal operation is performed on the 
combined 2004-2005 data, to estimate the changes between 
2004 and 2005 with the 2004 reference population (from 
which the “deaths” are removed in 2005). In this case, we 
need only to divide all the weights W, of the eight 
Subsamples 594; tO A994 by 8; in other words, L; =8 
for all 7. In 2006, when we look at the 2005-2006 changes, 
the denominator L; may take only two values. In the first 
scenario, panel member i was in the sample frame used in 
2004 (and hence could have been selected in 2004) and so 
L, = 8. This is due to the fact that everything proceeds as if, 
in 2004, the seven selection processes for panels ajo); to 
Ayo9;,g had been carried out under exactly the same 
conditions. In the second scenario, individual i was not in 
the 2004 sample frame - but is in the 2005 frame and is 
necessarily iN d@yo9;, - and L,=1. For the 2006-2007 
changes, L; can be equal to 1, 2 or 8, and so on. We will not 
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have the set of all possible values of Z; in {1, 2, 3, ..., 8} 
until we measure the 2011-2012 changes. 

Once we reach this point in the longitudinal weighting 
process, we can calculate the longitudinal weights W,'"*! 
and then derive the estimator of the difference A,,,, using 


rhea = De aan Whe bats 1 i (4) 


Sit4l 


Logically, the weights W,’'"' are used only to estimate 
change. They are of no value for point estimates because the 
inference population has little meaning on a particular date. 
Note that up to this point, the W,’'' have not been 
corrected for non-response or adjusted in any other way. In 
practice, equation (4) will be subject to adjustments in the 
case of the SILC survey. 

Estimation of the difference Ae a1 = Yar —Y, 18 a cross- 
sectional matter and therefore involves the weighting 
process described in the next section. 


4. Cross-sectional weighting 


The aim is to make an inference about the total in-scope 
population ©, on the current date ¢. The essential difficulty 
lies in the fact that in theory, a given (panelized) subsample 
provides adequate coverage of the population only in the 
year in which it was selected. After that year, the panel 
subsample no longer represents the new population of 
“births”, the units that become in-scope. That is the case for 
newborns, immigrants, individuals who reach specific age 
thresholds, homeless people who start living in a regular 
dwelling, people who leave communal dwellings, and so on. 
While in practice we might consider this coverage defect 
acceptable for a period of time, it very quickly becomes a 
serious problem (that is true each year for most panel 
subsamples), and a top-up sample must be obtained in some 
fashion. It is worth noting that the problem of population 
change over time is highly dissymmetrical, since the 
subpopulation that disappears from year to year (the 
“deaths’’) presents no particular difficulties for weighting. 

In the SILC survey, the top-up sample is obtained as 
follows. We survey all individuals in the household of each 
panel member interviewed in the longitudinal tracking 
process. Thus, every household surveyed in the cross- 
sectional process is made up of two types of people: panel 
members and cohabitants (people who are surveyed but are 
not panel members). This method covers a large portion of 
the “births” (in the broad sense) in the population. However, 
it does not cover households consisting entirely of “births”, 
such as households of new immigrants. “Birth” status is 
usually determined by asking the birth date of newborns and 
the landing date of immigrants. Moreover, in practice, the 
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weakness in births coverage is generally regarded as very 
minor because it is partially corrected with adjustments. 

The main technique used to produce cross-sectional 
weights is the weight share method (Lavallée 2002). As 
noted previously, in year ¢ we have nine panel subsamples 
a,,(<k <9). We will describe two different ways of 
using the weight share method. The information that must 
be collected in the questionnaire is the same for both 
methods. 


4.1 Method 1 


The more rigorous approach involves linking all nine 
subsamples a,, to the cross-sectional sample for year 1, 
which we will denote #, (Merkouris 2001). In other words, 
the sample i, is the same as s,, =U;- 4,,. First, we must 
determine the links associated with this approach. When a 
panel member in one of the nine subsamples a,, is 
selected, he/she points to himself/herself as a member of the 
cross-sectional sample at ¢ (similar to what is shown in the 
diagram in 3.1). Under these conditions, when the survey is 
in steady-state mode, the cross-sectional weight W;) of an 
individual i in a@, is calculated as shown below. The 
household of which i is a member is denoted m. We have 


PLAGE 


9 
k=1 jem 
JE ak 


9 

Deemed 

k=1 jem 
JE Qi 


Ue (5) 


where W(t, k) is the sampling weight from sample a, ,. 

This expression shows that all members of the same 
household ultimately have the same weight. In the 
numerator, we have the sum of all the “raw” weights (the 
sampling weights) of the household’s panel members. It is 
understood that a panel member generally appears in only 
one subsample, but that there may be cases in which a panel 
member is selected two or more times over a period of nine 
consecutive years (usually because he/she has moved). Note 
that dwellings selected from the master sample and the new- 
housing sample frame are not supposed to be selected again 
and therefore, in the case of SILC, the probability that an 
individual who has not moved will appear in two different 
panels is zero. 

As in the longitudinal case (see 3.1), weighting can be 
carried out only if the data management system is capable of 
linking each panel member in 7, to all panel samples a,, in 
which he/she is included. In the denominator, for each of the 
nine years ¢t — 8 to ¢ considered, we count the household 
members (both panel members and cohabitants) who are in 
the sample frame from which the incoming panel subsample 
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for the year in question is selected. This calculation clearly 
requires the information provided by the questionnaire. 
There are two advantages to this approach: it is 
completely general, and it produces unbiased cross-sectional 
weights directly because every cross-sectional household is 
necessarily linked to one of the nine subsamples involved. 
The fact that there is an incoming subsample each year 
ensures the completeness of the cross-sectional population 
Q,,; that is, in more technical terms, it ensures that there is 
at least one link for each household considered at ¢. This is a 
useful property of rotational sampling, as discussed in 
section 2.2. On the other hand, the weighting formula has a 
disadvantage, which is its (relative) complexity both in 
theoretical terms and for computer programming purposes. 
In the start-up phase (up to and including 2011), the 
formula must be adjusted. The numerator remains the same, 
but the denominator covers all individuals who could be 
sampled in 2004 (the survey’s first year) and subsequent 
years. In 2004, weighting is trivial since there is no weight 


share, but in 2005, we have 
9 


LAC 


k=1 jem 


Ww = ate (6) 
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In 2006, the formula will be 
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4.2 Method 2 


It is possible to take an alternative approach to cross- 
sectional weighting one that leads to a “slightly” simpler 
equation and is easier to program, but one that presents a 
difficulty that was not present in the previous method and 
may make the final weights somewhat less precise. The idea 
is to use one subsample at a time rather than all of them at 
once. We take one of the nine subsamples a,, and the 
sample of households to which it leads. We then apply the 
weight share, which when the survey is in steady-state mode 
yields an individual weight equal to 


» Wit.k) 


jem 
~ JEa,; 
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for any individual i in household m. It is very easy to verify 
that if k= 1 (which is the case for the incoming subsample), 
W.(t, 1) is the sampling weight of household m. 

The problem with this approach lies in the existence 
(a priori) on date t of individuals who cannot be surveyed 
because they belong to households that cannot be “reached” 
through the sampling a,, (as long as k22), ie., 
individuals whose probability of being surveyed at f is zero. 
This impediment did not exist in the previous method 
because taking all the subsamples into account at once 
ensured that on date t, every household had a non-zero 
probability of being selected, at least through a,,. This 
illustrates once again one of the key advantages of rotational 
sampling, which is that it covers the entire population each 
year. In our approach, it is clear that if we consider a,, for 
k > 2, the population of households consisting exclusively 
of “immigrants” (in the broad sense) between t— + 1 and ¢ 
is not covered. To formalize the situation and produce the 
final cross-sectional weight, we will use pee to denote 
the population of “immigrants” (in the broad sense) on date 
t in households that consist only of immigrants who can be 
sampled after year a, with t-8<a<t-—1. To be more 
precise, we should say “who can be sampled on or after a 
date that is strictly subsequent to the collection date in year 
a.” 

On date ¢, the entire population ©, is partitioned into 
nine components: the eight subpopulations aris! with o 
ranging from f— 8 to ¢— 1, and the subpopulation consisting 
of individuals who either were already surveyable at ¢— 8 or 
became surveyable on a date subsequent to f — 8 (i.e., who 
immigrated after ¢— 8) but at ¢ are members of a household 
containing at least one person who is surveyable at t— 8. We 
consider that if the household at ¢ contains at least one 
person who is surveyable at ¢ — 8, that will be the case on 
any date between ¢ — 8 and ¢ — 1. This ignores situations in 
which an individual who is in-scope on a given date 
becomes out-of-scope for a time (as a result of emigration, 
for example) and then becomes in-scope again. 

Next, we use 7, to denote the cross-sectional sample at 
t from panel a,,, which leads to Upa,, =a,. Let yin™® 
be the total of the Y’ defined on Q'™™"*. Following the 
weight share performed for all & = 2, ...,9, we have 


t-1 : ; 
immig 
sy oa 


a=t-k+l1 


al 2 hy Gob Yin oa 


Ze Unk 


\ pecans (9) 
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and 


a 2 oon! =Y, (10) 
since u,, = 4,,. 

If we were using shorter-duration panels, we might be 
able to ignore the Y."""™S and take the actual total over Q,. 
In that case, the “raw” final cross-sectional weight of any 
individual i would be W,(t,k)/9 if i is from a,,, Which 
would yield the final estimator 


PUA OR (11) 
However, since the panels used in France have long 
lives, we will probably not be able to ignore the iat (an 
analysis of the collection files will provide the answer), 
which will mean having to compute specific weights for the 
individuals in OF ns . In those circumstances, we check that 
any individual i in Orn who ends up in the cross- 
sectional sample a, will have a raw cross-sectional weigh 
W equal to the weight share value W,(t,k) divided by 
t—a (and therefore 1<¢—a <8). Any individual in Q, 
who does not belong to any of the Ome (i.e., the vast 
majority of individuals) will have a final weight of 
W(t, k)/9. Note that if i is in Orne: he/she can be 
surveyed only through a, ,,4,5,..., 4 Thus we have 


ott =O 


(12) 


yi) Wit,kA)t-a) ifieQmm 
7 otherwise 


W.(t,k)/9 


In the start-up phase, the weighting process has to be 
adjusted. In 2005, the final cross-sectional weight of 
individuals in (3%4°s995 will come directly from the 
selection of the dwelling from aj );, (they can only be 
reached through this incoming panel). In contrast, all other 
individuals can be surveyed “normally” in the nine panels 
Aros, (1<k <9), so that their weights as calculated by the 
weight share method will all be divided by 9. In 2006, the 
weights of the individuals in Q3%53%995_ Will be equal to the 
weight of the dwelling in which they live, a weight that 
directly reflects the sampling from QQ 9, ,; the weights of 
the individuals in 3701899. Will be the weights from the 
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weight share divided by 2; and the weights of all other 
individuals will be the weights from the weight share 
divided by 9. 

This procedure can be carried out for one subsample after 
another and does not have to take account of what happens 
in other subsamples. If an individual is surveyed at ¢ through 
two (or more) different subsamples a,,, we carry out the 
full procedure for each of the two (or more) subsamples. 
This could occur, for example, in the case of a household 
composed of two panel members from two different 
subsamples a,, who married each other and before their 
marriage were each tracked separately as one-person 
households. In that scenario, each individual would be 
“formally” surveyed twice, once as a panel member and 
once as a cohabitant. 


* 


Finally, to estimate the difference A,,,,= Y,,,—Y,, we 
can use the weights W,) from method 1 and calculate 
iN Le Ye AE hae fd ys Wey (13) 


Alternatively, we can use the weights a from method 
2. In that case, the estimator of the difference A, 14 Will be 


given by 


Nis = DS Vee jas 2 De Woe (14) 


iE Uy, ie U, 
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Cell collapsing in poststratification 


Jay J. Kim, Jianzhu Li and Richard Valliant ' 


Abstract 


Poststratification is a common method of estimation in household surveys. Cells are formed based on characteristics that 
are known for all sample respondents and for which external control counts are available from a census or another source. 
The inverses of the poststratification adjustments are usually referred to as coverage ratios. Coverage of some demographic 
groups may be substantially below 100 percent, and poststratifying serves to correct for biases due to poor coverage. A 
standard procedure in poststratification is to collapse or combine cells when the sample sizes fall below some minimum or 
the weight adjustments are above some maximum. Collapsing can either increase or decrease the variance of an estimate but 
may simultaneously increase its bias. We study the effects on bias and variance of this type of dynamic cell collapsing 
theoretically and through simulation using a population based on the 2003 National Health Interview Survey. Two 
alternative estimators are also proposed that restrict the size of weight adjustments when cells are collapsed. 


Key Words: Bias; Combining cells; Coverage error; Poststratification; Under-coverage; Weight trimming. 


1. Introduction 


Poststratification is a common technique used in survey 
weighting that can serve to (1) reduce variances or (2) adjust 
for deficient coverage by the sample of some groups in the 
target population. In household surveys in the U.S. the 
second purpose is especially important because some 
demographic groups, like young Black males, are covered 
less well than others (e.g., see Kostanich and Dippo 2000, 
chapter 16). Adjusting for undercoverage can lead to 
differential weights, which may correct for bias but will also 
increase standard errors. Practitioners often avoid making 
extreme weight adjustments, in effect trading-off some bias 
reduction in order to keep variances under control. 

One method of controlling the size of weight adjustments 
is to collapse the initial poststratification cells together if the 
adjustment in a cell exceeds some limit. Little (1993) and 
Lazzeroni and Little (1998) cover methods of collapsing 
categories of ordinal poststratifiers. Other strategies for how 
to collapse strata or construct estimators have been 
suggested by Fuller (1966), Kalton and Maligalig (1991), 
and Tremblay (1986). Kim, Thompson, Woltman, and Vajs 
(1982) give some practical applications. In this paper, we 
study the effects on bias and variance of combining cells, 
assuming that more finely defined cells would be preferable 
if the sample sizes and sizes of weight adjustments were 
within some tolerances set by the survey designers. 

Two criteria are often used to decide whether a cell 
should be collapsed with another. The first is the inverse 
coverage ratio or initial adjustment factor (LAF), and is 
defined as the ratio of the control count to the initially 
weighted sample count for the cell. A ratio which is 
significantly different from 1 indicates that coverage is 


either low or high for the group represented by the cell. 
When the IAF for a cell falls outside some bounds set in 
advance, the cell is combined with another. For example, 
the collapsing threshold for “high” ratio might be 2 and the 
threshold for “low” ratio 0.6, which are the bounds used in 
the Current Population Survey (CPS) conducted by U.S. 
Bureau of the Census (see Kostanich and Dippo 2000, page 
10-7). The second criterion is the sample size. A cell whose 
raw sample count is too small may be collapsed on the 
grounds that the IAF is unstable. We will refer to a cell as 
sparse if it violates one or the other of the criteria and is 
collapsed with another cell. 

The categories of the variables that define poststrata are 
usually sorted based on a natural ordering (e.g., age or 
income categories) or a convenient ordering (e.g., race- 
ethnicity). Common practice is to collapse a cell with an 
adjacent one which is similar in characteristics, disregarding 
different coverage ratios of the individual cells. 

Kalton and Flores-Cervantes (2003, page 95) observed 
that “methods that automatically restrict the range of the 
adjustments are redistributing the excess adjustments that 
would otherwise be given to some respondents to other 
respondents. The appropriateness of this redistribution 
should be examined.” This paper indeed examines its 
appropriateness and identifies circumstances where the 
weight redistribution due to collapsing may be quite 
harmful. 

An obvious weakness of popular collapsing strategies is 
that coverage bias for some groups will be incompletely 
corrected. For example, suppose that the survey estimate for 
the number of units in a group is only 1/3 of the census 
count, so that initial weights would have to be multiplied by 
3 to correct for undercoverage. If cell collapsing restricts the 


1. Jay J. Kim, National Center for Health Statistics, Centers for Disease Control and Prevention; Jianzhu Li, Joint Program in Survey Methodology, 
University of Maryland; Richard Valliant, Survey Research Center, University of Michigan. 
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weight adjustment for units in that group to a factor of 2, 
then the survey estimate for the number of units in the group 
will be only 2/3 of the census count. In addition, if cells with 
much different means are combined, bias can be introduced 
rather than corrected. The incomplete correction for under- 
coverage and collapsing of cells with disparate character- 
istics may lead to bias in totals, means, and other types of 
estimates. 

Table 1 gives some illustrative coverage ratios, ie., 
survey estimates prior to poststratification divided by census 
counts, for the March 2002 U.S. Current Population Survey 
and the 2003 Behavioral Risk Factors Surveillance Survey 
(BRFSS) in a set of 44 counties in the southwestern U.S. 
The survey estimates include a nonresponse adjustment for 
both CPS and BRFSS. Coverage ratios shown for the subset 
of demographic groups in the CPS range from 0.70 to 0.93 
with Black-only typically being less than for other groups. 
BRFSS is a telephone survey with low response rates in this 
set of counties, and the ratios for BRFSS are much smaller 
than for CPS. There are also substantial differences in 
coverage ratios for different groups in BRFSS. For example, 
the ratio for 35 - 44 year old Hispanic males is 0.18 but is 
0.37 for Black/Multiracial/Other males in the same age 
range. If these two groups were collapsed, incomplete 
coverage would be under-corrected for the Hispanics but 
over-corrected for the Black/Multiracial/Other group. 
Another example is the 2003 National Health Interview 
Survey (NHIS) where American Indians and Asians were 
collapsed with Whites within age groups. In the cell for ages 
25-29, for example, the coverage rates for Whites, 
American Indians, and Asians were 0.60, 0.44, and 0.31, 
respectively (Tompkins and Kim 2006). 

This paper demonstrates the weaknesses of current cell 
collapsing procedures and proposes some _ alternatives. 
Section 2 discusses the bias of some standard estimators 
when there is undercoverage. Section 3 introduces two new 
estimators that retain more of the undercoverage adjustment 
than the standard method when cells are collapsed. 
Empirical properties of the standard and alternative methods 
are investigated through simulation in section 4. We 
conclude in section 5 with a summary and some possibilities 
for future research. 


2. Some standard estimators 


Three standard estimators of the population mean are the 
Hajek estimator, the poststratified estimator, and the 
poststratified estimator with initial poststrata collapsed 
together where necessary. Each of these is defined in detail 
below. When the sampling frame covers units in the target 
population at different rates, each estimator can be biased. 
Kim et al. (2005) give some numerical illustrations of the 
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effects of collapsing pairs of cells with different coverage 
rates. 

To derive theory for alternative estimators of means, we 
model a unit’s being covered by the sampling frame and a 
cell’s being sparse or not as random events. Define three 
indicator variables: 5, =1 if unit & is selected for the 
sample and 0 if not; c, =1 if unit & is covered by the frame 
and 0 if not; d, =1 if poststratum / is classified as sparse in 
a particular sample and 0 if not (i=1,...,/). These 
indicators are assumed to be mutually independent and to 
have expectations 7,,,, and p,, respectively. Consider a 
stratified, two-stage probability sample design. A design 
stratum is denoted by h;s, is the set of primary sampling 
units (PSU’s) selected from design stratum /; s,,;; is the set 
of sample units from sample PSU / in stratum / that are also 
in poststratum i;U, is the population of PSU’s in stratum 
h;U,, is the population of units in PSU / within stratum /; 
and U,,,;. is the population of units in PSU within stratum 
h that are in poststratum i. For the analysis in this section, 
the sample design does not need to be specified in more 
detail. Note that some summations in sections 2 and 3 of the 
form >, > jes, Paste for units within poststratum 7 could 
be simplified to X;<;, without loss of generality. We have 
used the more elaborate notation to make clear how the 
stages of sampling should be treated. 


2.1 Hajek estimator 


First, consider the Hajek estimator of a mean, which is 


bs pee ee a aat Vr in, 


= =F iN. (1) 


oe F 
ies y h ge pee: at 


The expectation of iB with respect to sampling and the 
coverage mechanism is £.E. (T.) Bye pies jeu, 
LeU ye VME =T a were ileere superscript denotes 
“covered”. Similarly, the expectation of N, is E.E(N,)= 
LiLaLyev, Lkev,,. de =N°. Expanding y, around 
(T°, N°), its linear approximation is y,=7°/N* +1/N°x 
wi —(T°/N ane Next, consider the bias of y. as an 
estimator of Y =>/_,7,/N with 7, being the total for the 
full population of units in poststratum / (not just the covered 
portion). After some calculation, the bias is 


DSi lass bape ae 
N° N 7 El me! 

where >= 0,/N, Cy, = X(0, — 0)(, —Y)/N, Y =TIN, 
and > denotes the sum over i,h, 7¢U,, and ke Uniti: 
Consequently, y_ is biased if there is any correlation 
between the variable measured, y, and the coverage 
probability ,. The bias in (2) is O(1), meaning that it 
remains important even in large samples. 


bias (F,) = (2) 
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Table 1 

Coverage ratios from the Current Population Survey (CPS) and Behavioral Risk Factors Surveillance Survey 
(BRFSS). White only and black only mean that only those races were reported by a respondent in the CPS. 
Residual-only race group includes cases indicating a single race other than white or black, and cases indicating 
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two or more races. Hispanics may be of any race in the BRFSS tabulation 


March 2002 Current Population Survey 


Age White-only Black-only Residual-only 
Male Female Male Female Male Female 
0-15 0.93 0.93 0.78 0.79 0.91 0.90 
16-19 0.90 0.88 0.76 0.81 0.93 0.75 
20 - 24 0.79 0.85 0.72 0.77 0.75 0.72 
25 - 34 0.83 0.89 0.70 0.76 0.76 0.80 
All ages 0.90 0.92 0.78 0.83 0.85 0.84 
2003 BRFSS: 44 border counties in Arizona, California, New Mexico, and Texas 
Age White Non- Hispanic Black, Multiracial, Other Hispanic 
Male Female Male Female Male Female 
18 - 24 0.19 0.26 0.12 0.24 0.15 0.22 
25 - 34 0.20 0.31 0.10 0.16 0.19 0.39 
35 - 44 0.28 0.31 0.37 0.25 0.18 0.30 
All ages 0.25 0.31 O25 0.20 0.18 0.31 


Sources: Bureau of the Census (2002), Gonzalez, Town, and Kim (2005). 


If the coverage probability is the same for every unit in 
poststratum i, ie., >, = (i) for any keU,,,, then the 
approximate bias reduces to bias(y_) = Gey, W, x 
(o(@)-6)(¥%,-Y) where W,=N,/N and Y,= 
Lipiel, kev VE /N,. If there is a correlation between the 
poststratum coverage probabilities and the poststratum 
means, the Hajek estimator will again be biased, and the 
bias could be either positive or negative. If the coverage 
rates or the means are constant across poststrata, i.e., 
o(i)=6, or Y,=Y, then the Hajek estimator will be 
unbiased, but poststrata are usually not formed this way. 
Also, the bias exists even when the appropriate set of 
poststrata, that subdivide the population into groups with 
different means, is unknown to the sampler. 


2.2 Poststratified mean with no cell collapsing 


The poststratified mean is defined as Vp = 
1/NXL,(N,/N,,)T., where T., and N_, are defined as in 
(1) but excluding the summation over i. Define 
Ty = Lig, bey Vi Me and N; = Zh, jeU, KU yyy Pe These 
are the expected (with respect to the coverage mechanism) 
total and count of covered units in poststratum 7. Expanding 
Ypg, around (7;°, N‘),i=1,...,Z, its linear approximation 


1S 
NU en oe Tae 
T° +) —|T, -——N,, ||. 
i Say, No J 
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At) tk N, 
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The bias of the poststratified estimator is then the first term 
of this expression minus )j_,7\/N, and after some 
manipulation, can be written as 


o, 


bias (Ys) = Wy, (3) 


where $; = Xo, /N;j, Cy; = Xb, -— O)O%- Y;)/N;, and ps 

denotes the sum over h, 7 €U,,, and k €U,,;. Thus, Ypg, 
is biased if there is any correlation between the y variable 
measured and the coverage probability », in any of the 
poststrata. If the coverage rate is constant at 6, = (i) 
within poststratum i, then the poststratified estimator is 
approximately unbiased. From (3) it is apparent that 
poststrata should be formed so that either coverage rates or 
the y’s are homogeneous within each poststratum. This is 
similar to the recommendations of Eltinge and Yansaneh 
(1997), Kalton and Maligalig (1991), and Little and 
Vartivarian (2005) for the formation of nonresponse 
adjustment cells. In large surveys, the initial set of candidate 
poststrata is often more extensive than the sample can 
support. With few exceptions, some of the initial poststrata 
are collapsed to control weight adjustments. If no collapsing 
occurs, this is usually because small categories are pre- 
collapsed based on prior experience in the same or a similar 
survey. In that sense, PS1 does not really exist in practice. 
More common is the collapsing approach, PS2, described 
below. 


2.3 Poststratified mean with collapsing 


Turning to the poststratified estimator with collapsing, 
the sparse cells are identified and combined with other cells 
considered to be their nearest neighbors. This could result in 
more than one sparse cell being collapsed with a given 
nonsparse cell. Neighbors can be defined in various ways, 
e.g., cells with similar estimated coverage rates, N,, /N,, 
cells that are adjacent in some substantive sense like nearby 
income classes, or cells that have similar means on some 
important survey variables. The general algorithm for 
collapsing, given an initial set of cells, is: 
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(1) Compute the collapsing criteria for each cell, e.g., 
the IAF’s, N,/N_,, and the cell sample sizes; 


(2) Identify the sparse cells, i.e., those whose criteria 
fall outside the bounds for collapsing; 


(3) Determine the nearest, nonsparse neighbor of each 
sparse cell and combine the sparse cell with its 
neighbor. 


The poststratified mean with collapsing is then 
Ypso= 1/NY (N AER EE where Te Se ie, 1 ) 
y,/m, and No is defined similarly. "Define T= vi 
and N, = 4, NE with A, being the set of Sete in 
colina group g. Examine Ypgy around GENT), for, 
each collapsed group g, gives 


v 1 iN ae rans Ye 
Yoo) =— Ti+) —*)7,-—N 
Yps2 i Leys g ey mg Né mg 
It follows that 
bias (Vos, ) = Dey ie Cove (4) 
. 
with 


W,= NgIN, ¥g= 504 /Nogo Ciyg= E40 1a Ly)! Np 


and the summations in 6, and C,, are over ie A,,h, 
jeUu,, and keU,,,,. If >, is constant within collapsed 
group g, this estimator is unbiased, but if o, = (i), ie., the 
coverage rate is constant within poststratum 7 but can differ 
across the poststrata, then the bias becomes 


* 


bias(Vpg)) = yauaie ee: (5) 
* be 
with §,= DW yO), Cye= IW, (00) — 9, (EF, - Ys Wei= 
DN ata i summations are — i€ A. 

THUS, in the case where ),,, will be unbiased, Vg Will 
be biased if poststrata are collapsed together that have 
different coverage rates and different population means. 
Since >, and ep jeg are both O(1), the bias does not 
decrease as the sample increases; thus, the bias-squared will 
eventually be the dominant part of the mean square error. If 
cells are collapsed, the cells in each group should have the 
same coverage rates, the same means, or both to avoid bias. 


3. Weight restricted estimators 


We examine two alternative methods of weight compu- 
tation when collapsing of poststrata is used, extending work 
of Kim (2004). The alternatives are designed to be 
compromises between (a) use of all poststrata and the 
potential for large weight adjustments and (b) collapsing of 
poststrata yielding less variable weights but potentially 
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biased estimates. We refer to these as weight restriction 
(WR) methods. The two alternatives presented in this 
section use cell collapsing but retain a larger share of the 
weight adjustment for individual cells than does the 
standard collapsing method. 

The first alternative is denoted PS.WRI and consists of 
the following algorithm. Denote the maximum allowable 
weight adjustment by f,,,, with fi... >1 


(1) Execute steps (1) - (3) of the algorithm in section 
2 tor PZ. 


(2) Censor any IAF greater than f,,, to f,,, and 
adjust each weight in the corresponding initial cell 
to Ww, =w, fi. With w, =1/7,. For units in cells 
with IAF < f,,., set W, =W,. 


max? 


(3) Compute a collapsing adjustment factor (CAF) for 
a collapsed group g as 


ty ~ Nghdi 2 viis anit Wie 


(4) The final adjusted weight is then w, if, for unit k 
in group g. 


This method will reduce the largest values of the final 
weight adjustment below the without-collapsing adjust- 
ments, NIN though there may be one or more groups 
that have CAF’s greater than the f,,,, cutoff The control 
total for group g, N,, is met in the sense that 
pS eure — ae but the control totals for the 
individual cells in A, are not. 

To analyze the properties of PS.WRI, define A, ,, and 
A, to be the sets of sparse and nonsparse eRe in 
collapsed group g. PS.WRI can be expressed as 


LB i! 4 a 
YPs.wRl =—») alee 
Mise SN 


where 


T wri = none pay: >= OF de Tmax We Vk 
Fda Dak Daa viae w 
ieAS h JES, KES py: kk 


and Nw: has a similar definition with y, set to 1. The 
expectation of Twp, over the coverage, sparseness, and 
sampling mechanisms is E,£,,E,, (T ae 
te where Tf =Dies, Pil; . Likewise, EEE, (Newri) = 

NE + (fog - DNS with No= Diem DN; . Yeswri Can be 
expanded around the expectations of (T ZWRD N gwri)- After 
some manipulation, the approximate bias of Yycwr; 


becomes 


te (6) 


biases) = Dae, g (ad), 


where 
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(ad), = > 0:6, /N,. 0; =14+ Foo — DP; 
Coogee = Oude 7 (9) Mi, = ¥,)/Ngs 
and the summations are over RAPE, and 


k €U,,;,. In the case of a common coverage probability in 
poststratum i, 7.e., >, = o(i), we have 


(2), =. + (Snax —1)(P)¢ 
and 
a,o(i) —(ad)z = (0) - O,) + Sx — (POM - (0), ) 
where (pd) g = La, W,; poi). From this it follows that 
og Ce eee UC 


ao, ¥, & ) Pe, V,g 
with 
© ily aoa (p,9(i) — (po), \Y, -¥, ). 


If the cell means Y, are all equal within a collapsed group, 
then C;,, =C,4,).¢ =0 and Ypg we; Will be approximately 
unbiased. In the special case in which coverage is constant 
in a group, ie. o(Z) =o, theneC ie ain 
6, Dies, VP; - BMY, -Y,) with B, =D, W_p;- Thus, 
if p,; and $(7) are constant within A,, then Yesweri Will be 
nearly unbiased even if the cell means Y,,i¢€ A,, differ. 
This condition is almost sure to be false as long as one 
poststratum in a group has a probability of being sparse that 
is substantially different from the others. 

In the case of a common coverage probability in 
poststratum 7, (7), we can also compare the biases of the 
collapsed cell estimator, ),,,, with that of Y,5 yp, Using 
results in the previous paragraph, the bias in (6) can be 
expressed as 


Ben ES, Core /, + Smex =) Cra,y.2 /, 
bias (Yps wri) = bap 0 
1+ (Fina — (PO) 9 /%, 
Since 1+ (inex — (po), /, 21, we can use (5) to obtain 
Ibias Gee. | 


<[DWe[ Core Pe + Some ~DEphyce/¥e| 
= bias Giza) ar ee y 1) ee OSs g (7) 


If p,o(i) and Y, are uncorrelated, the absolute bias of 
Yoswr, is less than or equal to that of p<, because 
1+ (fom —D(P0), /, 21. When p,o(i) and ¥ are 
correlated, those are two cases to consider: (1) 
bias(¥,,,) 20 and (ii) bias (Vps>) < 0. In the former, the 
last line of (7) will be less than or equal to the absolute bias 
Of Veen it 


-2|bias (¥ps.) 
Ste | 


s pie eae es. I, <0. 
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In case (ii), the requirement is 


_ bias (Pre, ) 
O BY ty, Oneu inion ge a Geel 


If the covariance between the probability of being sparse 
and covered, p,(i), and the cell means, ie is small in all 
groups and the opposite sign of bias(¥p,,), then Vpywe, 
will be less biased than ,,,. 

The second alternative is denoted PS.WR2 and is 
intended to exercise more control over the size of the final 
weight adjustment than does PS.WR1. In PS.WRI1 the final 
adjustment can be larger than /,,,,. PS.WR2 seeks to limit 
the final adjustment to /,,,, =2 or some other maximum 
set in advance. The general idea is to first determine which 
cells should be collapsed together, as was done for PS.WRI1. 
Then weights in the sparse cells are multiplied by /,... The 
weights in the non-sparse cell in a collapsed group are then 
adjusted by a constant factor to bring the estimated 
population count in the group to the control count. The 
detailed algorithm for computing weights for PS.WR2 is the 
following: 


(1) Execute steps (1) - (3) of the algorithm in section 
Z.sHOt Po2. 


(2) Ina group containing at least one non-sparse cell, 
compute the control total in group g as 
N, =Xies, N; and the adjusted weight for all units 
kin A, 4, aS W, = Wy Snax: 

(3) Compute the eau mes for all units k in 

as =w,(N,-N,,. Ney /N, = — _where 

eats 2, le ote Wy and 1 


ed ey Lh, jes kesy¢ ees 


98D 


(4) The final adjusted weight is then w, for unit & in 
group g. 
This second weight restricted estimator can be written as 
ee =(1/N)X, fe where 


T wr. 3 Denti Duvitte KES (i) Feast e Ve 


Ne = [PRE rae 
a Py _ D5 jes, kes § Wii 
& 9p oJ h? hj(i) N a 


Sp 
where N, Be eee rea, VE The expectation of 
v8 ewr2 Over the coverage, sparseness, and sampling 
mechanisms is 

JNO iten fete . 

= c & max” & c @ 
EB EME (Ta fl tonsiiae aie tu Re CCN 
& 


& 


where N; ay PN; . After some calculation, the 
approximate bias of 7 Vos. wr can be written as 
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bias ps wr2) 


: l We c c 
= a = [5 3) 


moi ‘TaN 


| Dan Tie aNi)(X.,7 | 
where po ,=T,/N, and po—=>,977/2, 4,N 
Next, note that in the case of a common coverage 
probability in cell i, 6, = o(i), 


Sar a yam] 
4 ye Dcnpiacue cong (4:9, — (9), (¥% > yr) 
=> Nao - (49). -¥,) 


= a (Gis Cs y, 2) 
with g, =1— Pi (9d), = q,;0,/N,, and (cn oe bye Were 


defined previously. Next, use the fact that > 4, 9i Nz.= 
No NS to define P; = N,/N,, the proportion of units 
ererediun group g, and “ps = Ne /N,, the expected 
proportion covered in sparse cells in group g. Then, the bias 
can also be written as 


bias (Yps wr2) = om Ne a ale =H a) 
CE 
oyg PO. 
+ " We 7 pope Dal (8) 
Judging from (8), Ypgwr> Will be approximately 


unbiased if the mean per unit for the units covered by the 
frame in each collapsed cell is the same in the sparse cells, 
Hosp» a8 in the nonsparse cells, i.e., HW, ., = He and the 
covariances, Coe and C,,. are beth 0. The latter is 
accomplished by combining cells with the same means, Y,. 
Combining cells with equal coverage rates does not result in 
Yrgwr> being unbiased. This is more restrictive than for 
Ys wry Which is unbiased if either the coverage rates or the 
means are the same in all cells in a collapsed group. 


4. Anempirical investigation 


To test some of the ideas presented earlier, we conducted 
a simulation study of the bias properties of alternative 
methods of poststratification. We also examined the 
performance of one variance estimator that is often used in 
practice. 


4.1 Study population 


The population used in the simulation was extracted from 
the 2003 National Health Interview Survey (NHIS) person 
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public-use file. A subset of the NHIS was created with 
21,664 persons. These were divided into 25 strata with each 
having six PSUs. The strata and PSU’s are based on those in 
the NHIS public use file, but sets of three strata were 
collapsed together to create new design strata for the study 
population. We used four binary variables (0-1 char- 
acteristics) for the simulation, each of which is based on a 
person’s self-report: 


Health insurance coverage - whether a person was 
covered by any type of health insurance; 


Physical, mental, or emotional limitation - whether a 
person was limited in any of these ways; 


Medical care delayed - whether a person delayed 
medical care or not because of cost in last 12 
months; 


Overnight hospital stay - whether a person stayed 
overnight in a hospital in last 12 months. 


Table 2 shows the percentages of persons with these four 
characteristics in cells formed by age and sex. These 16 
(age x sex) cells are the initial set of poststrata used in 
estimation. The percentages can vary substantially among 
the cells, depending on the characteristic. For, example, 18- 
24 year olds are much more likely to have no health 
insurance; children under age 5 and the elderly age 65 and 
over are much more likely to have had a hospital stay. 
Collapsing cells together that have different means, or 
proportions in this case, has the potential to introduce bias, 
as noted earlier. 

We also created one artificial binary variable that had a 
common mean of 0.20 regardless of the unit’s poststratum 
membership. In that case all estimators, including the Hajek 
estimator, will be unbiased regardless of coverage rates. 
Also, the conventional thinking that collapsing of cells may 
reduce variances by smoothing out extreme weight 
adjustments may hold for this variable. 


4.2 Sample design 


Two sample PSU’s were selected in each stratum with 
probability proportional to size (PPS) with the size being the 
count of persons in each PSU. Sampling of PSU’s was done 
with-replacement to simplify variance estimation. If 
without-replacement sampling had been used, then a more 
elaborate method of selection and variance estimation 
would have been needed (see, e.g., Sarndal, Swensson, and 
Wretman 1992, chapter 3). In each sample PSU, 20 persons 
were selected by simple random sampling without 
replacement for a total of 1,000 persons in each sample. For 
each combination of parameters discussed below, 2,000 
samples were selected. 

Sixteen initial poststrata were used which were the cross 
of the eight age groups, shown in Table 2, with gender. In 
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each sample, we computed the estimators of population 
proportions, described earlier in sections 2-3 - the Hajek 
estimator, y,, the poststratified estimator ),,,, that uses all 
16 poststrata, the poststratified estimator with collapsing of 
cells, Weck; and the two weight-restricted estimators, 
Voswri and Yoswp>. The simulation code was written in 
the R language (R Development Core Team 2005) with 
extensive use of the R survey package (Lumley 2004, 
2005). 


4.3 Coverage mechanisms 


Five sets of coverage mechanisms, shown in Table 3, 
were employed to filter the population before the PSU’s 
were sampled. The coverage ratios varied by poststratum 
and were different for each of the five characteristics for 
which proportions were estimated. The coverage ratios 
specific to each of the five characteristics are named Cl 
through CS in Table 3. These coverage ratios were 
artificially created based on the population means for each 
age and sex group. Poorer coverage was assigned to groups 
with larger percentages with a characteristic for health 
insurance coverage and limitations; the opposite was true 
for delayed medical care and hospital stays. In C5 the 
coverage ratios are quite variable and are intended to lead to 
coverage adjustments that vary substantially among the 
initial set of 16 poststrata. Although the rates in Table 3 are 
low, they are comparable to or higher than those for BRFSS 
in Table 1. In applying these rates, we randomly selected a 
subset of the population to be in the sample frame for each 
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sample that was selected. For example, if the coverage ratio 
in the poststratum of males younger than 5 years old is 0.9, 
then 90% of the population in that poststratum was 
randomly selected to stay in the sampling frame while the 
rest had a zero probability of being sampled. 


4.4 Collapsing rules 


We set up situations where the conditions for 
unbiasedness in sections 2 and 3 can be violated when cells 
were collapsed i in the simulations. Each of the estimators, 
Ypso> Yes wry 2nd Vpg we, involve cell collapses. If the IAF 
(poststratification factor) in an initial poststratum, JN, IN 
exceeds the maximum allowable adjustment, /,, 
cell sample size is less than a minimum, 7,,,;,, We call this 
poststratum a “sparse” cell and collapse it with a 
neighboring cell. We used two methods of determining 
neighbors, designated here as “adjacency” and “close- 
mean’. 

In adjacency collapsing, the neighbors of a specific cell 
are defined as the cells either horizontally or vertically 
adjacent to it in the age x sex table. For example, in the 
following, abbreviated table, the neighbors of cell 3 are the 
shaded cells 2, 4, and 7. 


Ti? 


sfantaliat he 


Table 2 Percentages of persons with four health-related characteristics in groups formed by age and sex 
Percentage of persons with characteristic 
Population Counts Not covered by Physical, mental, Delayed medical care Hospital stay in last 
health insurance emotional limitations in last 12 months 12 months 

Age Male Female Total Male Female Total Male Female Total Male Female Total Male Female Total 
= 843 TIS. ~ "15638 10 9 9 4 3 3 3 4 3 Mes 16 
5-17 DIT DOSZWAB53 13 14 13 10 6 8 4 4 4 2 1 2 
18 - 24 998 1,031 2,029 3) 3] 34 4 4 4 8 11 9 yi cl4 8 
25-44 5 2,971. 3207... 6,178 28 23 Des 4 ol ri 9 10 9 3.2010 6 
45-64 2,421 2,597 5,018 14 14 14 16 19 18 7 11 9 8 10 9 
65 - 69 305 384 689 2 1 2 24 29 27 5 8 6 15 14 14 
70 - 74 275 344 619 1 1 1 34 32 33 2 5 4 ts" 215 i 
it 423 717 ~—«1,140 1 1 1 4] 48 45 2 2 2 2ay S22 22, 
Total 10,507 11,157 21,664 18h tu 26 17 Ys 13 13 6 8 7 Taek LO) 8 


Table 3 Coverage ratios used in the simulations 


C1: Not covered by C2: Physical, mental, C3: Delayed medical C4: Hospital stay C5: Common Mean 


health insurance emotional limitations care in last 12 months Y 
in the last 12 months 
Age Male Female Male Female Male Female Male Female Male Female 
<5) 0.9 0.9 0.9 0.9 0.5 0.5 0.8 0.8 0.9 0.8 
5-17 0.8 0.8 0.9 0.6 0.5 0.5 0.5 0.5 0.7 0.2 
18 - 24 0.5 0.5 0.8 0.8 0.8 0.8 0.5 0.8 0.4 0.4 
25 - 44 0.5 0.5 0.8 0.8 0.8 0.8 OS 0.8 0.6 0.5 
45 - 64 0.8 0.8 0.5 0.6 0.8 0.8 0.5 0.8 0.3 0.8 
65 - 69 0.9 0.9 0.5 0.6 0.5 0.5 0.8 0.8 0.4 0.4 
70 - 74 0.9 0.9 0.5 0.5 0.5 0.5 0.8 0.8 0.2 0.7 
Tac 0.9 0.9 0.5 0.5 0.5 0.5 0.8 0.8 0.8 0.9 
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In the adjacency method, a sparse cell was collapsed with 
the neighbor with the smallest poststratification factor. In 
close-mean collapsing, a sparse cell was collapsed with the 
nonsparse cell whose unweighted sample mean was closest 
to that of the sparse cell. 

Two different values of /f,,, were used in the 
simulations — f,,,, =2 and 1.8. Use of f,,,, =1.8 leads to 
more collapsing of cells than f,, =2 and exhibits more of 
the biases (for the characteristics other than the artificial one 
with a common mean) noted in sections 2 and 3 caused by 
combining of cells with different means or different 
coverage rates. The minimum cell size was set to 
Ni min = 29. Of course, in practice many variations are used 
to decide which combinations of cells are allowable. We 
have used just two of the possibilities for illustration in the 
simulation. 

Once all of the sparse cells and their neighbors are 
identified, the collapsing process proceeds sequentially from 
cell 1. In a survey with many potential poststrata defined in 
advance, these procedures might have to be performed 
iteratively to eliminate all sparse cells. In this simulation, we 
performed only one round of collapsing. 


4.5 Variance estimation 


For each of the estimators of a proportion, a linearization 
variance estimate was calculated. Each of the variance 
estimators is based on the linear substitute method (e.g., see 
Sarndal et al. 1992, chapter 5). The variance estimates for 
all estimators of proportions were computed using the 
svydesign, poststratify, and svymean functions 
in the R survey package. The general, theoretical 
approach is to make a linear approximation for a particular 
estimator. The linear approximation is rearranged so that the 
estimator is written as a sum of weighted PSU totals, and 
the variance estimator for with-replacement PSU sampling 
is used. The estimators Vos), Ves» Ypswev aNd Vegwry are 
treated as standard poststratified estimators for the purposes 
of variance estimation. For p<, define the following: 
u,=  N,/N,(y,-Y,),k€s, with y,=f_/N,. and s, 
being the set of all sample units in poststratum 
i; u, 18 known as a linear substitute; 


Un. = yb w,u,; and 


1,KESpy(;) 
= l = 
Ue ae Sy Ups 
Ny, JES) 


is then 


>» (Uj, - ty eF. (9) 


JES) 


The variance estimator for Veai 
l 


eee 


N* <n, -1 


V(Vps1 ) = 
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For the collapsed stratum estimator, ,,,, (9) applies with 
the linear substitute defined as 


se & 
Wen 


N 


gt 


with y, =T7,,/N,, and s, is the set of all sample units in 
group g. 

In the cases of PS.WRI1 and PS.WR2, we calculate the 
final weights as described in section 3 and call the R 
poststratify function. This results in the linear 
substitute being computed as 


Uy = (y, yo) She =y3 
N 
& 
because N, = Died, Die. W, = Ne 5 The mean Ve iS 
computed as 


Vg rs rai pay Se ee ames Yara ig 
Bd Gis es, MY [Ng 


The weighted linear substitute is then u, = w, (y; — Ve) and 
Uj. = 25 je does Wig - 

In the cases of PS2, PS.WRI1, and PS.WR2, these 
variance estimators do not account for the dynamic nature of 
cell collapsing which can vary from sample to sample. 
Consequently, there is a source of variation that is not 
accounted for, and we can anticipate that the variance 
estimates will be somewhat too small compared to 
empirical, simulation variances. 


4.6 Simulation results 


Tables 4-7 summarize results for coverage correction 
errors, relative biases of estimated proportions, variances of 
alternative estimators, and confidence interval coverage 
using linearization variance estimators. Table 4 shows 
average absolute coverage correction error, defined as 


€= (DIY) Din Nal, I (10) 


where d is one of the D=2,000 samples and W,, is the 
estimated number of units in poststratum 7 based on the final 
weights for a particular estimator (Hajek, PS2, PS.WR1, or 
PS.WR2). The value of e@ is O for the poststratified 
estimator with no cell collapsing, PSI, since it corrects 
coverage error completely in each of the 16 poststrata. To 
illustrate how the average coverage correction errors can 
vary, we estimated the proportions for the health insurance 
and common mean Y variable using the Cl and C5 frame 
coverage ratios. For most combinations of coverage ratios, 
collapsing method, and adjustment bound, PS.WR1 more 
effectively corrects for coverage error than the standard 
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collapsing estimator, PS2. For example, @=0.086 with 
PS.WRI1 for (health insurance, adjacency collapsing, 
Tmax = 2) While PS2 has @ = 0.120. In contrast, PS.WR2 is 
somewhat worse than PS2 in coverage correction. 

Table 5 presents the relative biases (relbias), defined as 
10052_,(¥, -Y)/Y where y, is one of the estimates of 
proportion for sample d. The Hajek estimates are badly 
biased for the first four characteristics since they include no 
correction for the differential undercoverage among the 
cells. The relbiases range from -12.1% for limitations to 
13.4% for hospital stay. As noted in section 2, the bias can 
be either positive or negative, depending on the correlation 
of coverage rates and cell means. 

Poststratification with no collapsing of cells (PS1) gives 
nearly unbiased estimates while the alternatives - PS2, 
PS.WRI, and PS.WR2 - all introduce a bias when using 
adjacency collapsing for the first four characteristics. The 
number of poststrata after collapsing, shown in Table 5, 
ranges from 6 to 16 when /,,,, =2 and from 5 to 13 when 
Trax = 1-8. The relative biases of PS2, using adjacency 
collapsing, range from -4.4% to 6.2% when /,,,, =2 and 
from -6.5 to 94% when /f,,, =1.8. With adjacency 
collapsing, The alternatives, PS.WR1 and PS.WR2, have 
biases that are intermediate between PS1 (no collapsing) 
and PS2. PS.WR1, in particular, is reasonably competitive 
with PS1 in terms of bias with adjacency collapsing. In 
contrast, close-mean collapsing yields PS2, PS.WR1, and 
PS.WR2 estimates that are essentially unbiased when 
Snax = 2. With mean collapsing and f,,,, =1.8, PS2 and 
PS.WR2 are still somewhat biased, but PS.WR1 compares 
well with PS1. For the fifth characteristic (Common mean 
Y), all estimators are nearly unbiased, regardless of 
collapsing method, as expected. 

One justification that is conventionally given for 
collapsing cells is that extreme weights will be reduced and 
variances of estimates will, in turn, be reduced. Table 6 
shows the ratios of the empirical variances of estimated 
proportions as a proportion of the variance of PS1. The 
Hajek estimates have variances that are about 12% and 18% 
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smaller than those of PS1 for health insurance and 
limitations, but are more variable than PS1 for delayed care 
and hospital stay. These results also make it clear that the 
variance of a poststratified estimator can be either increased 
or decreased by collapsing. There are some minor variance 
gains from using PS2 for some combinations for the first 
four variables, but with (adjacency, f...=2) the PS2 


max 


* variance of hospital stay is 17% larger than that of PS1. 


With (adjacency, f/f... =1.8), PS2 is 23% more variable for 
hospital stay. PS.WR1 does not have the extreme variances 
of PS2 in adjacency collapsing; like PS2, PS.WR2 has 
larger variance for hospital stay in adjacency collapsing. 
When close-mean, rather than adjacency, collapsing is used, 
variances of PS2, PS.WR1, and PS.WR2 are much closer to 
those of PS1. However, for the Common Mean Y variable, 
collapsing always reduces variance. The reductions are 
almost 20% for adjacency collapsing. 

The right-hand section of Table 6 lists the ratios of the 
empirical mean square errors (MSEs) of estimated 
proportions as a proportion of the MSE of PS1. With a few 
exceptions, PS2 is the worst choice of the poststratified 
estimators for the first four characteristics regardless of the 
combination of variable, /,,,, and collapsing method. 
When /,,.. =1.8, the choice that leads to more collapsing, 
the MSEs of PS2 range from 1.8% to 44.2% larger than 
those of PS1. The MSEs of both PS.WRI1 and PS.WR2 are 
near those of PSI with the exception of (hospital stay, 
Fnax = 1.8, adjacency) where the 6.3% bias of PS.WR2 
leads to an MSE 25.6% larger than that of PS1. Close-mean 
collapsing is preferable to adjacency collapsing, although 
for the first four characteristics none of the estimators have 
smaller MSEs than PS1, which does not use collapsing. 

The estimators again perform differently for the 
Common mean Y variable. The MSEs of Hajek, PS2, 
PS.WR1, and PS.WR2 are all less than that of PS1. The 
Hajek estimator has the smallest MSE, owing to the fact that 
poststratification is unnecessary to correct bias in estimating 
the mean. 


Table 4 Average absolute coverage correction error as defined in expression (10) 


Collapsing Adjustment Hajek PS2 PS.WRI PS.WR2 
Method Bound (standard _ (truncate weights (fixed maximum 
Saias collapsing) then collapse) weight adjustment) 
Cl Coverage ratios for health insurance variable 
Adjacency 2 0.257 0.120 0.086 0.221 
Close mean Z 0.257 0.080 0.127 0.281 
Adjacency 8 0.256 0.150 0.085 0.202 
Close mean 8 0.256 0.101 0.109 0.258 
C5 Coverage ratios for common mean Y variable 
Adjacency Ze 0.442 0.326 0.196 0.331 
Close mean 2 0.441 0.321 0.203 0.370 
Adjacency 1.8 0.442 0.330 0.206 0.376 
Close mean 1.8 0.442 0.337 0.214 0.446 
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Table 5 Relative biases (in percent) of estimated proportions. (Figures for Hajek and PSI are not affected by 
collapsing and are repeated in the four sections of the table to facilitate comparisons) 


Characteristic Range of Hajek PSI PS2 PS.WRI PS.WR2 
no. of (no (standard (truncate weights (fixed maximum 
poststrata after collapsing) collapsing) then collapse) weight adjustment 
collapsing 
Adjacency collapsing, adjustment bound = 2 
Health insurance (10, 16) -11.5 0.1 -4.4 1.0 -1.4 
Limitations (8, 15) -12.1 -0.3 -2.0 0.1 -1.0 
Delayed care (6, 14) 8.2 -0.2 Dee -0.6 0.9 
Hospital stay (9, 16) 13.4 0.2 G2 -0.7 2.8 
Common mean Y (OAL) 0.3 0 0.4 0.4 0.6 
Close-mean collapsing, adjustment bound = 2 
Health insurance (10, 16) -11.5 0.1 -0.5 0.5 -0.3 
Limitations (8, 15) -12.1 -0.3 -1.2 0.2 -1.1 
Delayed care (6, 14) 8.2 -0.2 -0.3 -0.3 -0.2 
Hospital stay (9, 16) 13.4 0.2 0.4 0.1 0.4 
Common mean Y (6, 11) 0.3 0 0.2 0.1 0.2 
Adjacency collapsing, adjustment bound = 1.8 
Health insurance (7, 13) -11.5 0.1 -6.5 Oe -3.5 
Limitations (7, 12) -12.1 -0.3 -3.4 0.3 -2.0 
Delayed care ‘oa 8.2 -0.2 30 -0.4 his 
Hospital stay (5712) 13.4 0.2 9.4 0.0 6.3 
Common mean Y Ors) 0.3 0.1 0.5 0.6 0.6 
Close-mean collapsing, adjustment bound = 1.8 
Health insurance (6, 13) -11.5 0.1 -1.6 0.3 -1.7 
Limitations (7; 12) -12.1 -0.3 -2.7 0.9 -2.4 
Delayed care (5710) 8.2 -0.2 0.2 -0.3 0.5 
Hospital stay (5, 12) 13.4 0.2 LS 0.3 2.0 
Common mean Y (5, 10) 0.3 0.2 0.3 0.3 0.3 


Table 6 Ratio of variances (or MSEs) to the variance (or MSE) of the poststratified estimator (PS1) with no 
collapsing. (Figures for Hajek are repeated in the four sections of the table to facilitate comparisons) 


Ratio of variances to the variance of the 


Ratio of MSEs to the MSE of the poststratified 


poststratified estimator (PS1) estimator (PS1) 
Characteristic Hajek PS2 PS.WR1__PS.WR2 (fixed Hajek PS2 PS.WRI1_ PS.WR2 (fixed 
(standard _— (truncate maximum (standard _ (truncate maximum 
collapsing) weights then weight collapsing) weights then weight 
collapse) __ adjustment) collapse) | adjustment) 
Adjacency collapsing, adjustment bound = 2 
Health insurance 0.877 1.014 1.025 0.991 1.500 1.101 1.018 1.006 
Limitations 0.821 0.966 1.035 0.977 1.555 1.008 1.017 0.992 
Delayed care 1.099 1.023 1.003 1.000 1.239 1.023 1.000 1.000 
Hospital stay 1.290 1.169 1.000 1.073 P33 1.244 1.000 1.070 
Common mean Y 0.755 0.805 0.908 0.818 0.752 0.801 0.904 0.826 
Close-mean collapsing, adjustment bound = 2 
Health insurance 0.877 1.013 1.014 1.008 1.500 1.006 1.006 1.006 
Limitations 0.821 0.999 1.025 0.994 | Be ays: 1.008 1.017 1.008 
Delayed care 1.099 0.997 0.998 1.001 1,239 1.000 1.000 1.000 
Hospital stay 1.290 1.011 1.000 1.008 Las. 1.012 1.000 1.012 
Common mean Y 0.776 0.935 0.974 0.902 0.781 0.933 0.973 0.906 
Adjacency collapsing, adjustment bound = 1.8 
Health insurance 0.877 0.960 1.044 0.976 1.500 1.179 1.024 1.048 
Limitations 0.821 0.939 1.032 0.961 SPP: 1.034 1.017 1.000 
Delayed care 1.099 1.051 0.991 1.032 1.239 1.057 0.989 1.034 
Hospital stay 1.290 L223 1.043 1.201 1733 1.442 1.023 1.256 
Common mean Y 0.780 0.815 0.882 0.828 0.779 0.816 0.893 0.829 
Close-mean collapsing, adjustment bound = 1.8 
Health insurance 0.877 1.010 1.006 1.019 1.500 1.018 1.000 1.024 
Limitations 0.821 0.983 1.051 0.975 LDS 1.034 1.034 1.017 
Delayed care 1.099 1.003 0.995 1.001 1.239 1.000 1.000 1.000 
Hospital stay 1.290 1.052 1.001 1.059 RY33 1.035 1.000 1.047 
Common mean Y 0.771 0.924 0.958 0.876 0.778 0.932 0.959 0.879 
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Coverage rates in percent of 95% confidence intervals computed using ¢-distribution with 25 DF. 
(Figures for Hajek and PS1 are not affected by collapsing and are repeated in the four sections of the 


table to facilitate comparisons) 


Characteristic Hajek PS1 PS2 PS.WRI PS.WR2 
(no collapsing) (standard (truncate weights (fixed maximum 
collapsing) then collapse) weights adjustment 
Adjacency collapsing, adjustment bound = 2 
Health insurance 75:9 93.8 90.1 94.4 93.6 
Limitations 70.9 94.5 031 94.5 93.9 
Delayed care 92.0 94.0 94.5 94.0 94.6 
Hospital stay 82.2 94.5 91.8 94.3 93.9 
Common mean Y 94.8 93.8 94.6 94.4 94.5 
Close-mean collapsing, adjustment bound = 2 
Health insurance ISD 93.8 O57 94.2 94.0 
Limitations 70.9 94.5 93.0 94.3 93.5 
Delayed care 92.0 94.0 93.6 93.9 93.8 
Hospital stay 82.2 94.5 94.3 94.7 94.6 
Common mean Y O34) 92.9 92.2 92. 93.2, 
Adjacency collapsing, adjustment bound = 1.8 
Health insurance Tipe 93.8 87.5 94.1 92.4 
Limitations 70.9 94.5 92.0 94.2 93.3 
Delayed care 92.0 94.0 94.8 94.5 94.5 
Hospital stay 82.2 94.5 88.4 94.0 91.1 
Common mean Y 94.8 94.1 94.3 94.8 94.7 
Close-mean collapsing, adjustment bound = 1.8 
Health insurance 13.9 93.8 92.8 94.3 9333 
Limitations 70.9 94.5 92> 94.5 93.0 
Delayed care 92.0 94.0 93.8 94.0 94.4 
Hospital stay 82.2 94.5 93.8 94.6 93.8 
Common mean Y 94.9 94.5 93.6 93.8 94.8 


Table 7 reports the empirical coverages of 95% CI’s 
computed using the estimated proportions and _ the 
linearization variance estimator that naturally accompanies 
each. A ¢-distribution with 25 degrees of freedom is used in 
all cases. The Hajek coverage rates are extremely poor, as 
expected, ranging from 70.9% to 92% for the first four 
characteristics. The poststratified estimators, PS1 and 
PS.WRI1 provide 93.8% to 94.7% coverage, i.e., near the 
nominal 95%. In contrast, PS2 coverage is somewhat poor 
for Health insurance and hospitalization, especially for 
(adjacency, /,,,, =1.8) where the coverages are 87.5% and 
88.4%. Coverage rates for PS.WR2 are slightly less than for 
PS.WRI but are reasonably close to nominal. Use of close- 
mean collapsing generally improves the cases of poor 
coverage found with adjacency. For Common Mean Y 
coverages are good, ranging from 92.2% to 94.9%. 

In summary, the weight-restricted estimators, PS.WR1 
and PS.WR2, have some advantage over the standard 
collapsing estimator, PS2. They are generally less biased 
and retain more of the undercoverage adjustment than does 
PS2. However, the most critical element in bias-control is 
how the cells are collapsed in the first place. Collapsing 
using nearness of cell means or coverage rates is far more 
preferable than collapsing using some adjacency criterion 
based on neither of these. Only when cell means were equal 
did we observe any gain in MSE from collapsing cells. 
However, equality of cell means is the exception in practice. 


5. Concluding remarks 


Designers of surveys of households or establishments 
often have a lengthy list of poststrata or cells in mind when 
they develop weighting systems. If the sample size in a 
poststratum is small or the sample estimate of the population 
count in a poststratum is much different from an external 
control count, the poststratum may be collapsed with an 
adjacent one. The conventional justification for collapsing is 
that the possibility of creating extreme weights is reduced as 
are variances of estimates. 

However, a poor choice of the method for collapsing has 
at least two undesirable consequences: (i) deficient frame or 
sample coverage in some cells is not completely corrected 
and (ii) estimates from the standard approach to collapsing 
may be quite biased. The latter problem can result in 
confidence intervals that cover at much less than the 
nominal rate. Collapsing leads to bias when coverage rates, 
cell means, or both are correlated within a collapsed 
poststratum. The bias can be either positive or negative, 
depending on the correlation. 

Cells should be collapsed based on similarity of coverage 
rates, population cell means, or both in order to avoid bias. 
This method of collapsing can be much different from 
standard procedures that only collapse “adjacent” cells, e.g., 
by combining contiguous age groups. If the adjacency 
coincides with cells that have similar coverage rates or 
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means, no bias results. But, this should be checked rather 
than assumed. 

There are at least two practical issues with collapsing 
based on cell means. One is that, while the theory directs us 
to collapse based on population means, in a particular 
sample we will only have estimates for the population 
covered by the frame. Coverage may be so deficient that the 
means of the covered and non-covered parts of the 
population are substantially different, even within the initial 
poststrata. This would be a case of “nonignorable non- 
coverage.” If so, poststratification based only on the initial 
set of cells or combinations of them cannot correct coverage 
bias. A second practical issue is that data on many items are 
collected in most surveys. Collapsing based on the cell 
means for one variable may not work well for other 
variables. In that case, the compromise, suggested by Little 
and Vartivarian (2005) for nonresponse adjustment, of 
collapsing based on some weighted average of the means of 
an important set of variables should be a good solution. 

Extensions of this research would be to examine the 
performance of the class of calibration estimators in 
correcting coverage errors. Poststratification is a special 
case. When categories of qualitative auxiliaries are 
combined due to small sample sizes or other reasons, the 
same bias problems we have illustrated here may be intro- 
duced in more general calibration estimators. One method 
of allowing some flexibility to depart from controls while 
retaining important auxiliaries is already available in Rao 
and Singh (1997). The effect of their proposals on coverage 
bias needs to be investigated. 
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A single frame multiplicity estimator for multiple frame surveys 


Fulvia Mecatti ' 


Abstract 


Multiple Frame Surveys were originally proposed to foster cost savings on the basis of an optimality approach. As surveys 
on special, rare and difficult-to-sample populations are becoming more prominent, a single list of population units to be 
used as a sampling frame is often unavailable in sampling practice. In recent literature multiple frame designs have been put 
forward in order to increase population coverage, to improve response rates and to capture differences and subgroups. 
Alternative approaches to multiple frame estimation have appeared, all of them relying upon the virtual partition of the set 
of the available overlapping frames into disjointed domains. Hence the correct classification of sampled units into the 
domains is required for practical applications. In this paper a multiple frame estimator is proposed using a muitiplicity 
approach. Multiplicity estimators require less information about unit domain membership hence they are insensitive to 
misclassification. Moreover the proposed estimator is analytically simple so that it is easy to implement and its exact 
variance is given. Empirical results from an extensive simulation study comparing the multiplicity estimator with major 


competitors are also provided. 


Key Words: Difficult-to-Sample populations; Dual frame survey; Misclassification; Raking ratio; Variance estimation. 


1. Introduction 


In classic finite population sampling a basic hypothesis is 
the availability of a unique and complete list of units 
forming the target population to be used as a sampling 
frame. In some cases a set of two or more lists is available 
for survey purposes. The general case of Q > 2 lists, 
singularly partial and possibly overlapping, is known as 
Multiple Frame Survey. Multiple frame surveys were 
originally introduced (Hartley 1974) as a device for 
reducing survey costs by achieving the same precision as a 
customary unique-frame survey. In modern sampling 
practice, as surveys of special, rare and difficult-to-sample 
populations are becoming more common (Kalton and 
Anderson 1986; Sudman and Kalton 1986; Sudman, Sirken 
and Cowan 1988) it is often the case that a unique list of 
units does not exist and the population size N is an 
unknown parameter to be estimated. Recent literature 
considers multiple frame surveys with the main aim of 
increasing population coverage, of improving response rates 
and of capturing differences and subgroups more accurately 
(lachan and Dennis 1993; Carlson and Hall 1994; Haines 
and Pollock 1998; Eurostat 2000). In a recent paper Lohr 
and Rao (2006) stated: “As the U.S., Canada, and other 
nations grow in diversity, different sampling frames may 
better capture subgroups of the population. [...] We 
anticipate that modular sampling designs using multiple 
frames will be widely used in the future”. A contemporary 
application could be found in web surveys: the population 
coverage can be improved and the bias due to the features of 
the site used for data collection can be reduced by using two 
or more independent web sites simultaneously. Since the 


same unit can visit more than one site involved in the 
survey, the sites overlap configuring a multiple frame 
framework. 

Estimation in multiple frame surveys, as first developed 
by Hartley (1962, 1974), is based on the virtual partition of 
the population (ie., the unknown union of the Q 
overlapping frame) into 2° —1 disjointed domains (i.e., the 
mutually exclusive intersections of frames). Hence the total 
Y of a study variable y, taken as the parameter to be 
estimated, is expressed as a sum of domain totals. Sample 
data from the Q frames are used to produce estimates for 
the domain totals. Estimated domain totals are finally 
combined to provide estimation for the population total Y. 
A number of estimators have been developed according to 
alternative approaches to multiple frame estimation (see 
Section 2). Since all estimators appearing in literature rely 
on the partition into the domains as mentioned above, the 
correct identification of the domain membership of each 
sampled unit is required for their practical application. This 
is a strong assumption that may not always be true in 
practice, as argued for instance in Lohr and Rao (2006). 
Indeed this implies that every sampled unit should be 
questioned on both the survey value and on its membership 
to each frame involved in the survey, in order to be able to 
correctly classify them into the domains. In addition to the 
natural risk of misclassification there might also be a risk 
connected with confidentiality and with the sensitivity of 
units to the frame membership which could both increase 
the rate of non-response and affect the estimator precision. 
This situation could apply, for instance, when surveying 
sensitive characteristics (private behaviours, addictions ...) 
or when sampling e/usive populations (illegal immigrants, 
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ex-prisoners, patients...). In the present paper a different 
approach to estimation in multiple frame surveys is adopted. 
The concept of unit multiplicity, corresponding to the 
number of frames to which units belong, is proposed in 
alternative to the existing approaches based on the domain 
membership, i.e., to which frames units belong. An unbiased 
estimator, naturally insensitive to domain misclassification 
and applying to any number of frames, is presented. The 
proposed multiplicity estimator has a simple analytical 
structure so that it can be easily implemented, while its exact 
variance is given in a closed form and hence readily 
estimated for any sample size. 

In Section 2 an overall discussion of the main contri- 
butions to multiple frame estimation is presented in a 
unified view and the necessary notation is introduced. In 
Section 3 a multiplicity estimator is proposed and variance 
estimation is analysed. An extensive simulation study 
comparing the proposed estimator with major competitors is 
presented in Section 4. 


2. Optimum, pseudo-optimum and 
single frame estimation 


Although literature has mostly dealt with the dual frame 
case (QO = 2), a general theoretical framework for multiple 
frame surveys (Q = 2) has been recently provided in Lohr 
and Rao (2006). By using their multiple frame notation, 
different estimation approaches are briefly reviewed and the 
available estimators are presented in a unified way which 
highlights their dependency on the domain membership of 
the sampled units. 

Let A,---A,-+-dp bea collection of Q > 2 overlapping 
frames, the union of which offers a population coverage 
adequate for survey objectives. Let the index sets K be the 
subsets of the range of the frame index qg =1-:-Q. For 
every index set K c {1---q---Q} a domain is defined as 
the set Dg = (Nex 4) 1Qaex Ar), where denotes 
complementation. 

Let the domain membership indicator be the indicator 
6,(K) taking value 1 if unit 7 is included in domain D, 
and 0 otherwise. The estimating total Y over the (unknown) 
union of the Q overlapping frames is then expressed as a 
sum over the set of 22 — 1 disjointed domains 


Y= Y wad d 3)», (1) 


ieU, 4, K ieU, A, 


Let s, be a sample selected from frame A, under a 
given design, independently for g =1---Q. A general 
expression of a multiple frame estimator based on the 
domain classification is then 
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Y= PY Uw?) y, (2) 


qéK ies, 


Note that when a unbiased estimator for the total Y is 
given, an estimator for the population size N is also given 
by simply substituting sample values y, by 1’s. 

Estimators available in literature result from setting 
weights w” in (2) according to three main approaches. 
Since multiple frame surveys were originally put forward 
with the aim of fostering cost savings by achieving equal or 
greater precision than a customary unique-frame survey, an 
optimum approach was first suggested by using optimum 
weights wi in (2), ie, by minimizing the estimator 
variance (Hartley 1962, 1974; Lund 1968; Fuller and 
Burmeister 1972). Optimum estimators have optimal 
theoretical properties (Skinner 1991; Lohr and Rao 2000) 
but present practical problems due mainly to their 
complexity (explicit though complex formulae for optimum 
weights wer with any number of frames are given in Lohr 
and Rao 2006, Section 3). Moreover, optimum weights 
depend on unknown population covariances so that they 
must be estimated from sample data. This is both compu- 
tationally complex and affects optimality since the extra 
variability in estimating the covariances leads to larger mean 
square errors (Lohr and Rao 2006, Section 7). 

In order to improve the applicability, a single frame (SF) 
approach has been proposed by using fixed weights which 
ensure design-unbiasedness. For simple random sampling in 
every frame, the SF estimator is given by substituting 
weights w in (2) with Wie = yw) = (Dgex La where 
f, = n,/N, denotes the frame sampling fraction (Bankier 
1986; Kalton and Anderson 1986; Skinner 1991; Skinner, 
Holmes and Holt 1994). Since fixed weights usually differ 
from optimum weights, the SF estimator is generally less 
efficient than an optimum estimator (Lohr and Rao 2000). 
Finally a pseudo-optimum approach was proposed (Skinner 
and Rao 1996; Lohr and Rao 2000) in order to achieve both 
a wider applicability than optimum estimators and to 
improve efficiency compared with the SF approach. A 
pseudo-maximum likelihood (PML) estimator for multiple 
frame surveys is given by substituting in (2): Wend = 
w) = Neld feeds 5,(K) = Nx /ny where the esti- 
mated domain sizes N, are the solution of a system of 
non linear equations. Although complex to implement for 
practical applications (an iterative linear approximation of 
N x under simple random sampling is given in Lohr and 
Rao 2006, Section 4.1) the PML estimator retains good 
theoretical properties from the optimum approach. 

Note that formula (2) involves the domain membership 
indicator 5,(K); hence optimum, pseudo-optimum and SF 
estimators apply only if the correct classification of sample 
data into the 2° — 1 domains is accomplished. 
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In the next Section a multiple frame estimator is 
presented on the basis of a single frame multiplicity ap- 
proach which does not require domain classification. 


3. The single frame multiplicity estimator 


The notion of multiplicity was first introduced in 
connection with Network Sampling (Casady and Sirken 
1980; Sirken 2004). It is also a tool of the Generalized 
Weight Share Method (Lavallée 2002; 2007) as well as of 
the Center Sampling estimation theory (Mecatti 2004) since 
center sampling and multiple frame surveys are equivalent 
under certain conditions. In Lohr and Rao (2006), the 
multiplicity of domain D, is defined as the cardinality of 
the index set K. Since domains are mutually exclusive, 
multiplicity is also a characteristic of every population unit, 
being the number of frames in which each unit is included 
among the Q involved in the survey. 

Let m, be the multiplicity of unit i. Note that unit 
multiplicity may be collected simply by asking sampled 
units how many frames they belong to. 

Since clearly 1, di. 4,¥i = Die, 4, Vi» it follows that 


Q 
cath SEN re (3) 


Notice that expression (3), which involves exclusively 
sums over the frames, represents a practical advantage with 
respect to equation (1). In fact the domains provide a virtual 
(unknown) partition of the population while the sample 
selection is actually performed in the Q overlapping 
frames. This leads to a SF multiplicity estimator as given by 


c Q 
Yj ni)i Waly ay (4) 


q=l ies, 


with fixed weights wo ensuring, for instance, design- 
unbiasedness. For simple random sampling of every frame 
we have wi? = fe, WAREL Ss 

Unlike the optimum, PML and SF estimators discussed 
in Section 2, estimator (4) does not involve the sample 
membership indicator and it is very simple to implement in 
practical applications. Furthermore, it is to be noted that for 
simple random sampling of every frame, the sampled values 
in multiplicity estimator (4) are weighted by (f,m;,)"', ie., 
by a specific frame coefficient; vice versa, in the SF esti- 
mator sampled values are weighted by Wee = (Noex y ae 
i.e., by an average coefficient over the frames involved in 
each domain. As a consequence Y. iS expected to be more 
accurate than the SF estimator, as confirmed by simulation 
results. Moreover, owing to its Horvitz-Thompson structure, 


the exact variance of Y, can be derived in closed form. For 
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simple random sampling of every frame the estimator 
variance is given by 


Vep= 


SMa | Sytme-=| Domi 
q=l nN, —}) ieA, ie A, 


An unbiased variance estimator for simple random 
sampling of every frame is then 


AAS 


a i [Sy m . (6) 


TES, 1ES, 


5 EON See 
A CNSE | Bie Oey 
The performance of the multiplicity estimator for finite 
sample sizes has been empirically studied under simple 
random sampling and compared to major competitors in a 
simulation study. 


4. Simulation study 


Several simulation results concerning dual frame 
estimators have appeared in literature (Bankier 1986; 
Skinner and Rao 1996; Lohr and Rao 2000). In the general 
case of Q = 2 frames, Lohr and Rao (2006) extensively 
investigated the empirical mean squared errors of a set of 
eight estimators under optimum, pseudo-optimum and 
single frame approaches, in a three-frame framework under 
a two-stage design. Their results suggest that optimum 
estimators are theoretically optimal but in practice the extra 
variability in estimating optimum weights leads to larger 
mean squared errors. Hence the PML estimator appears as 
the best performer in terms of empirical relative efficiency. 
Furthermore, their study regarded a case of about 10% of 
sampled units misclassified into the domains and more 
research on the effects of misclassification on the estimator 
performances is recommended. 

In the present study pseudo-optimum and single frame 
estimators are compared with the multiplicity estimator (4), 
with three main objectives: 


i) to investigate empirical conditions in which the 
multiplicity estimator results more efficient than 
the SF estimator (Section 4.2); 

ii) to consider the raking ratio correction to known 
frame sizes N, as already proposed in order to 
improve efficiency of the SF estimator (Section 
4.3); 

iii) to explore the effects of increasing rates of 
misclassification upon the empirical properties of 
the PML and SF estimators (simple and raked) 
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versus the natural insensitivity of the multiplicity 
estimator (Section 4.4). 


4.1 Implementation 


The simulation study was performed in an artificial three- 
frame setup and implemented as follows. N population 
pseudo-values y, are generated from a Gamma distribution. 
Some preliminary simulations indicated that both increasing 
values of the population size N and different values for the 
Gamma parameters (leading to an asymmetrical and almost 
symmetrical shape) do not produce significant differences in 
the pattern of the relative performance of the estimators 
considered. The study was then conducted by setting 
N = 1,200 and by generating from a Gamma distribution 
with parameters of 1.5 and 2. Every pseudo-value y, is 
randomly assigned to the Q =3 frames according to 3 
independent Bernoulli trials with probability a, = N,/N, 
g =1, 2,3. Different scenarios regarding both frame 
coverage and frame overlapping result from different 
choices for a Bs under the two constraints: a) 1, 0 ae | in 
order to ensure that the 3 frames cover the entire population 
and b) the 3 frames are non-empty. In some cases, the 
desired frame overlapping was produced by fixing the ratio 
N,/N_ ofthe population units included in each domain. 

Chosen a set of sampling fractions f, = n,/N,, q =1, 
2, 3, a simple random sampling is selected independently 
from every frame, iteratively for 10,000 simulation runs. For 
a given estimator, say Y, the collection of values ey » P= 
1...10,000} is assumed as its monte carlo distribution and 
the empirical mean E,,.(Y)= .. ey 10,000 and the empir- 
ical mean squared error MSE.) > ely. — Y}’/10,000 
are calculated. The monte carlo error is controlled by only 
accepting simulations giving empirical relative bias 
RB,,,(Y) = 100: |E,,.(Y) — Y |/Y less than 1.5% for those 
estimators known to be unbiased. Furthermore, by using the 
exact variance of the multiplicity estimator as given by (6), 
simulations ensure |MSE,, (Y,,) — V (Y,,) |< 0.03. Sev- 
eral different scenarios have been investigated by combining 
different levels of frame coverage, of frame overlapping and 
of sampling disproportion, leading to 29 simulated 
populations. In Figure 1 the simulated populations are 
represented as points in the plane formed by the two main 
simulation parameters, namely the (total) frame coverage on 
the horizontal axis (as given by 2, a,) and the sampling 
disproportion on the vertical axis, i.e., the dispersion among 
the sampling fractions ff, as measured by 
Dada, wide |/3°. The different shape of populations/ 
points in Figure | indicates different levels of overlapping, 
namely the total rate of population units classified into the 
four overlapping domains. 
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Figure 1 Simulated populations 


4.2 Multiplicity versus simple single frame 
estimation 


As noted in Section 3, the multiplicity estimator involves 
specific frame weights whereas the SF estimator is based on 
average coefficients. As a consequence the two estimators 
coincide for constant sample fraction /, = f in every 
frame, i.e., for proportionate sampling, and they offer 
different estimates for disproportionate sampling. Simu- 
lation results provide empirical evidence that the multipli- 
city estimator is more accurate than the simple SF estimator. 
Estimator Y,, is shown to be more efficient in all the cases 
explored except in one extreme case in which the three 
frames are almost complete and hence the total overlapping 
is close to 100%. Neglecting this single case, efficiency 
gains of Y, vy over the SF estimator, as measured by a 
customary empirical efficiency ratio (see Table 1), range 
from 5% to 48%, and are never less than 26% in half of the 
simulations. Efficiency of the multiplicity estimator over the 
SF estimator increases as the sampling disproportion in- 
creases (see Table 2) whereas it has resulted as being 
essentially independent with respect to increasing levels of 
frame coverage and overlapping. 

Table 1 


Empirical efficiency ratio of ae versus SF esti- 
mator: Elementary statistics over 28 simulated 


populations 
average Max min median 75" quantile 
07425, 0:95. _ 0.52 0.74 0.89 


Table 2 
Empirical efficiency ratio of Y,, versus the SF esti- 
mator for increasing levels of sampling disproportion 


Sampling Disproportion 0.11 0.22 0.31 0.40 


Empirical efficiency ratio averaged 0.92 0.81 0.68 0.57 
for different levels of frames 


coverage/overlapping 
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4.3 Raking ratio adjustment 


It has been suggested that the raking ratio adjustment 
using the known frame sizes N, (Bankier 1986) be used in 
order to improve efficiency of the simple SF estimator 
Theoretical and empirical results that have already appeared 
in literature confirm that the raking ratio SF estimator 
(SFrak) can be considerably more efficient than the simple 
SF estimator (Skinner 1991; Lohr and Rao 2000, 2006; 
Mecatti 2005). 

In order to adjust the multiplicity estimator via raking 
ratio, knowledge of the domain membership of sampled 
units has to be assumed. By using this additional, though 
redundant, information Y,, may be rewritten as 

My = KIA) D 8K) », (7) 

K qek ies, 
where |K| indicates the number of frames involved in 
domain D, and it equals unit multiplicity m, for all 
i € Dy. Setting the initial weights at hy? = Ck ayes the 
t" iteration of the raking ratio multiplicity estimator (Mrak) 
is obtained by substituting the following raked weights in 


(7) 


i ae ifqekK 
(¢-l) 
hy’ = p> Nixa ke 
hese if qéK 


where g=Q if ¢ is a multiple of Q otherwise q = 
t mod(Q), for ¢ = 1, 2, --- until convergence. 

Simulations regarded different levels of frames coverage 
combined with different sets of sampling fractions, leading 
to increasing sampling disproportion. 

Empirical results show that Mrak is more efficient than 
SFrak in 38% of cases explored and it is equally or less 
efficient in the remaining cases. Efficiency gains range from 
3% to 74% and occur for low levels of frame coverage. For 
increasing frame coverage (and hence increasing over- 
lapping) Mrak estimator is superior to SFrak estimator for 
high sampling disproportion only. In the other cases, namely 
for increasing frame coverage/overlapping combined with 
low to medium sampling disproportion, Mrak can be 
considerably less efficient than SFrak (see Table 3 for the 
ten indicative cases) and also severely biased. Thus 
empirical results suggest that the raking ratio adjustment has 
better effects under a single frame approach than under a 
multiplicity approach, although there are conditions in 
which the latter is still superior. With this respect more 
research is needed. Particularly, since the raking ratio 
procedure is in fact a special case of calibration (Deville and 
Sarndal 1992; Deville, Sarndal and Sautory 1993), potential 
improvements might follow by applying the more general 
calibration to estimator Y,,. Calibration of the multiplicity 
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estimator, as viewed as a particular case of the Generalized 
Weight Share Method, is outlined in Lavallée (2002, 2007). 


Table 3 Efficiency of Mrak versus SFrak: Ten indicative 
simulation runs 


Frame coverage Sampling Empirical 
a, = N,/N fractions efficiency ratio 
Sq= 1gINa Mrak versus 
SFrak 
0.60 0.60 0.60 OO). 095" 0,15 0.26 
OFS au 0.35: O535 0.80 0.20 0.50 0.54 
0.85 0.85 0.85 U-OUIFU.955 AULD 0.71 
0.35 0.40 0.50 0.70 0.80 0.60 0.96 
0.85 0.85 0.85 0.80 0.20 0.50 1.01 
0.60 0.60 0.60 0.70 0.80 0.60 1.09 
0.80 0.50 0.35 O10 Lut: 95120915 22 
0.35 0.40 0.50 0.01 YO.935 40.15 1.63 
Oc 70.005 2.0.95 0.70 0.80 0.60 Ya Ve) 
O700-0) 0.95 0.80 0.20 0.50 oe) 


4.4 Misclassification 


The aim of the final part of the simulation study is to 
investigate the sensitivity of the pseudo-optimum (PML) 
and single frame estimators (simple and raked) to increasing 
levels of misclassification of sampled units into the 
domains, with respect to the structural insensitivity of the 
proposed multiplicity estimator. For a chosen rate of 
misclassification, the desired number of sampled units to be 
inexactly classified is taken from the domain with the 
largest size and randomly assigned to the remaining 
domains, independently for each frame. 

Tables 4 and 5 show elementary statistics summarizing 
simulation results in the case of exact classification and in 
the case of slight misclassification equal to 1% of sampled 
units. Note that for exact classification all the estimators 
appear unbiased (or nearly unbiased). As regards efficiency, 
according to other simulation results (Lohr and Rao 2006) 
SFrak and PML estimators show similar performances. As 
expected, for exact classification they are more efficient 
than Y vu in all the cases explored (except for two isolated 
cases) aS a consequence of the different amount of 
information used in the estimation process. However the SF 
(simple and raked) and PML estimators tends to become 
biased and less efficient than Y,, in presence of just a small 
amount of misclassification. 


Table 4 Relative bias in case of 1% of misclassification: Ele- 
mentary statistics over the 29 simulated populations 


(absolute) Average Min Max Median 75" 
RB,,,, 1% of quantile 
sampled units 
misclassified 
yy 0 0 0 0 0 
SF 2.5880 0.83 7.02 2.65 2.13 
SFrak E7632. 0573 4 LT 1.65 
PML Disa2. 0.23 4.67 3.46 2.87 


Statistics Canada, Catalogue No. 12-001-XPB 


156 Mecatti: A single frame multiplicity estimator for multiple frame surveys 


Table 5 . 

Empirical efficiency ratio of Y,, versus SF and PML esti- 
mators: Elementary statistics over the 29 simulated popu- 
lations for exact classification and for slight misclassification 


Empirical efficiency Average Min Max Median 75" 


ratio quantile 
Exact classification 
SFrak PAST OU69.3: 2D F531 1.28 
PML 1.41 0.72 3.30 1.47 | Ree 
1% misclassification 
SF 0.392 0039071" F054 0.34 
SFrak 0,783 COTES O95 0.74 
PML 0.77 0.14 1.94 0.98 0.70 


Finally we focused on the case of maximum efficiency of 
SF, SFrak and PML over ¥,, for exact classification, 
namely the case of high frame overlapping/coverage and 
low sampling disproportion. In this set up, increasing rates 
of misclassification of sampled units into domains (from 0 
to 50%) were investigated. Table 6 and 7 show respectively 
the relative bias and the efficiency ratio of Y,, versus SF, 
SFrak and PML estimators, for increasing levels of 
misclassification. It is to be noticed that although the 
negative effects of misclassification are rapid and severe for 
all the competitors, the PML estimator emerges as the least 
affected. 

As a conclusion the proposed multiplicity estimator, 
besides being simple, is recommended when the risk of 
(even slight) misclassification of sampled units into the 
domains is a concrete possibility. 


Table 6 (absolute) Relative bias for increasing rate 
of misclassification 
% misclassification Yyy SF SFrak PML 
0 0 0 0 = 0 
1% Wiese sy | 1.38 4.3 
5% Ore 13:57 Plone DOTS 
10% 0 17.80 14.14 4.56 
20% 0 Da 2D 6 
50% 0 144 68 39 


Table 7 f 
Empirical efficiency ratio of Y,, versus SF, SFrak and 
PML estimators for increasing rate of misclassification 


% Vi versus Vi versus Ys, versus 

misclassification SF SFrak PML 
0 0.640 3.210 3.300 

1% 0.260 1.040 1.100 

5% 0.020 0.060 0.370 

10% 0.010 0.020 0.150 

20% 0.004 0.004 0.080 

50% ~ 0 0.001 0.006 
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Variance estimation for a ratio in the presence of imputed data 


David Haziza ! 


Abstract 


In this paper, we study the problem of variance estimation for a ratio of two totals when marginal random hot deck 
imputation has been used to fill in missing data. We consider two approaches to inference. In the first approach, the validity 
of an imputation model is required. In the second approach, the validity of an imputation model is not required but response 
probabilities need to be estimated, in which case the validity of a nonresponse model is required. We derive variance 
estimators under two distinct frameworks: the customary two-phase framework and the reverse framework. 


Key Words: Imputation model; Nonresponse model; Marginal random hot deck imputation; Reverse framework; Two- 


phase framework; Variance estimation. 


1. Introduction 


Variance estimation in the presence of imputed data for 
simple univariate parameter such as population totals and 
population means has been widely treated in recent years; 
see for example, Sarndal (1992), Deville and Sarndal 
(1994), Rao and Shao (1992), Rao (1996) and Shao and 
Steel (1999). In practice, it is often of interest to estimate the 
ratio of two population totals, R=Y/X, where 
(Y, X) = Yi (0, x;), y and x denote two variables of 
interest potentially missing and U denotes the finite 
population (of size N) under study. Although variance 
estimation for a ratio in the presence of imputed data is a 
problem frequently encountered in practice (especially in 
business surveys), it has not been, to our knowledge, fully 
studied in the literature. In this paper, we consider the case 
Marginal Random Hot Deck (MRHD) _ imputation 
performed within the same set of imputation classes for both 
variables y and x. In other words, to compensate for 
nonresponse, random hot deck imputation is performed 
separately for both variables within the same set of 
imputation classes. This situation occurs frequently in 
practice. For simplicity, we consider the case of a single 
imputation class. Extensions to multiple imputation classes 
are relatively straightforward for most derivations presented 
in this paper. 

In this paper, we derive variance estimators that take 
sampling, nonresponse and imputation into account. Two 
distinct frameworks for variance estimation have been 
studied in the literature: (i) the customary two-phase 
framework (e.g., Sarndal (1992)) and (ii) the reverse 
framework (e.g., Shao and Steel (1999)). In the two-phase 
framework, nonresponse is viewed as a second phase of 
selection. That is, a random sample is selected from the 
population according to a given sampling design. Then, 
given the selected sample, the set of respondents is 


generated according to the nonresponse mechanism. In the 
reverse framework, the order of sampling and response is 
reversed. That is, the population is first randomly divided 
into a population of respondents and a population of 
nonrespondents according to the nonresponse mechanism. 
Then, a random sample is selected from the population 
(containing respondents and nonrespondents) according to 
the sampling design. As we will see in section 4, the reverse 
framework facilitates the derivation of variance estimators 
but unlike the two-phase framework, it requires the 
additional assumption that the nonresponse mechanism does 
not depend on which sample is selected. This assumption is 
satisfied in many situations encountered in practice. For 
each framework, inference can be based either on an 
Imputation Model (IM) or a Nonresponse Model (NM). The 
IM approach requires the validity of an imputation model, 
whereas the NM approach requires the validity of a 
nonresponse model. 

In section 2, we introduce notation, assumptions and the 
imputed estimator of a ratio under weighted MRHD 
imputation. The IM and NM approaches are then presented 
in sections 2.1 and 2.2. In section 2.3, the bias of the 
imputed estimator is discussed. In section 3, variance 
estimators are derived under the two-phase framework and 
the IM approach using the method proposed by Sarndal 
(1992). We show that, under MRHD imputation, the naive 
variance estimator (that treats the imputed values as 
observed values) generally overestimates the sampling 
variance when y and x are positively correlated. In section 
4, we derive variance estimators under the reverse 
framework and both the IM and the NM approaches using 
the method proposed by Shao and Steel (1999). Finally, we 
conclude in section 5. 
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2. Notation and assumptions 


Our goal is to estimate R. We select a random sample, 
s, of size n, according to a given sampling design p(s). A 
complete-data estimator is given by 


R= HT Af 
Xr 


(2.1) 


where (Yur, Xur) = Lies W,(¥% x;) denote the Horvitz- 
Thompson estimators for Y and X, respectively and 
w, = 1/2, denotes the sampling weight of unit i, where 7; 
is its probability of inclusion in the sample. The estimator 
R in (2.1) is asymptotically p-unbiased for R, ie., 
E (RB) = R, where the subscript p denotes the expectation 
and variance with respect to the sampling design p(s). 
Since R is a nonlinear function of estimated totals, its exact 
design variance, V(R), cannot be easily obtained. To 
overcome this problem, Taylor linearization is often applied 
in order to approximate the exact variance. An 
asymptotically p-unbiased estimator of the approximate 
variance of R is vg by 


ay Mare ex (2:2) 
ies jes 
where e, = 1/X,,(y, — Rx,), A, = (tj — 1, %,)/, 7,7, 


and 1, is the joint selection probability of units i and j. 
Note that 7,, = 7,. In the case of simple random sampling 
without replacement, the variance estimator (2.2) reduces to 


ees (1 — #) = I + Rs? -2Rs,, |, (23) 


where 
§ = LY OY, f = Les) 
and 
of 22 GOP») 
with 


(, X) = 1%»). 


We now turn to the case for which both variables x and 
y may be missing. Let a, be the response indicator of unit 
i such that a, =1 if unit i responds to variable y and 
a, = 0, otherwise. Similarly, let 5, be the response 
indicator of unit 7 such that b, = 1 if unit 7 responds to 
variable x and b, = 0, otherwise. Let s°’ be the set of 
respondents to variable y of size r, and s“ be the set of 
respondents to variable x of size r,. Also, let r,, be the 
number of respondents to both variables y an x. Finally, 
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let y, and x, denote the imputed values to replace the 
missing values y, and x, respectively. An imputed 
estimator of R is given by 


(2.4) 


where y, = a,y,+(0- ay, and x,=bx,+(- b xe 
Under weighted MRHD imputation, to compensate for the 
missing value y,, a donor / is selected at random with 
replacement from s‘” so that 


PY; =y) =e 
J; Oy, = 
W, & 
2 


Similarly, to compensate for the missing value ci , a donor 
k is selected at random with replacement from s“” so that 


ak wd 


les 


Paps =x,)= 


Note that, when both y, and x, are missing, j is generally 
not equal to k under weighted MRHD imputation. 

Random hot-deck imputation within classes is widely 
used in practice because (1) it preserves the variability of the 
original data; and (11) it leads to plausible values. The latter 
is particularly important in the case of categorical variables 
of interest. However, random hot-deck imputation within 
classes suffers from an additional component of variance 
due to the use of a random imputation mechanism. The 
main reason weighted MRHD imputation is used is that it 
leads to asymptotically unbiased estimator under the 
nonresponse model approach (see section 2.1) unlike 
unweighted MRHD imputation. 

Tet semLin lal Sop 5 cbeSS Jensen dle leSokaee 118 pam 
Cov, (.,.| s, s®, s®) denote the conditional expectation, 
the conditional variance and the conditional covariance 
operators with respect to the random imputation mechanism 
(here, weighted MRHD imputation). Using a first-order 
Taylor expansion, it can be shown that 


E;(R, | 8, 8°. 8 au =k, (2.5) 
where 
uw W, Qp; 
_ Wa, 
and 


>; 5 y; 


— ies 


wp grant 


ies 
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denote the weighted means of the respondents to variables y 
and x, respectively. The approximation in (2.5) will be 
valid if the sample size within classes is sufficiently large, 
which we assume to be the case. Now, let 


n= 2” 
Syy 7 SSF a;(y; - 
W, Gi ies 


ies 


and 


so, = Sa aie bx, = 
denote the variability of the y-values and the x- values in 
the set of respondents s° and s‘”, respectively. Noting 


that, under weighted MRHD i 
Vj) = Sy Vi) = Se, 
and 
Cov, (y;, x,) = 0 
we can approximate V, (R, | s, s°, s®) by 


ess 5) 


~ Ay: we (1 a,)s?, + yw z- ays | (2.6) 
Expressions (2.5) and (2.6) will be useful in subsequent 
sections when discussing the bias and the variance of the 
imputed estimator R,. As we will see in sections 3 and 4, 
the conditional variance (2.6) is a measure of the variability 
due to the imputation mechanism. 

Next, we describe two approaches to inference that will 
be used to obtain variance estimators in sections 3 and 4: the 
Nonresponse Model (NM) approach and the Imputation 
Model (IM) approach. 


2.1 The nonresponse model approach 


In the NM approach, inference is made with respect to 
the joint distribution induced by the sampling design and the 
nonresponse model. The nonresponse model is a set of 
assumptions about the unknown distribution of the response 
indicators R, = {(a,, 5,); i € s}. This unknown distri- 
bution is often called the nonresponse mechanism. Let 
Py, = P(a, =1| s, Z,) be the response probability of unit 
i to variable y, where Z, = {z,; i € s} and z, is a vector 
of auxiliary variables available for all sample units used to 
form the imputation classes. Similarly, let p,, = P(b, = 
1| s, Z,) be the response probability of unit 7 to variable 
x. We assume that units respond independently; i.e., p,,; = 
P(a,=1,a, =1|s, Z,) =p, py, for i# j and p,,= 


161 


PO, =1,6, =1|s, Z,)=pyp, for i* j. However, 
we do not assume that, for a given unit i, response to 
variable y is independent of response to variable x. In 
other words, if we let p_,= P(a, =1 b, =1|s, Z,), 
then we have p,,; # P,; ine in general. Within an impu- 
tation class, we assume a uniform response mechanism such 
CUD yy De Pe aud Ps, 

We also assume that, after conditioning on s and Z,, the 
nonresponse mechanism is independent of all other 
variables involved in the imputed estimator (2.4) as well as 
the joint selection probabilities. In other words, the 
distribution of R, does not depend on Y, = {y,; i € s}, 
W, ={w,ies} and I, = {a,;ies, j es}, after 
conditioning on s and Z.. As a result, except for the 
response indicators a, and b, we assume that all the 
variables involved in the imputed estimator (2.4) as well as 
the joint selection probabilities are treated as fixed when 
taking expectations and variances with respect to the 
nonresponse model. From this point on, we use the subscript 
q to denote the expectation and variance with respect to the 
nonresponse mechanism. 


2.2 The imputation model approach 


In the IM approach, inference is made with respect to the 
joint distribution induced by the imputation model, the 
sampling design and the nonresponse model. The impu- 
tation model is a set of assumptions about the unknown 
distribution of (Y,, X,) = {(;, x); i¢ U}. Within an 
imputation class, the imputation model, m, in the case of 


MRHD imputation, is given by 
yj ms; H,, a g; 
Mm: (20) 
xX, =, +7); 
where €, is a random error term such that E£,, (e;) = 0, 


E,,(€€,;) = 0, for i # j, V,,(€;) = o. and n,; is arandom 
error term such that £,,(n;) = 0, £,,(n,n,) = 0, for 
it j,V(€) = c,: Furthermore, we assume that 
E,, (€;;) = Sg, Here, E,,(.), V,,(.) and Cov,,(.) denote 
respectively the expectation, the variance and the covariance 
operators with respect to model m. It is implicit in the 
notation that expectations or variances with respect to model 
m are conditional on Z,, = {z;; i ¢ U}. In this approach, 
we assume that the distribution of the model errors (€,, 
Ny) = {(€» 1); i € U} does not depend on s, s0, s®”, 

W, ={w;ieU} and OH, = {x,;i¢ U, j < UV}, afer 
conditioning on Zy. Asa ails except for the variables of 
interest y and x, all variables involved in the imputed 
estimator (2.4) are treated as fixed when taking expectations 
and variances with respect to the imputation model. 
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2.3 Bias of the imputed estimator 


To study the bias of the imputed estimator (2.4), we use 
the standard decomposition of the total error of R, : 


R, - R=|R-R|+| E,(R, | 5,5, 8) - R] 
+r Rye EphRreh Sx8e e08 dl (2.8) 


The first term R—R on the right-hand side of (2.8) is 
called the sampling error, the second term £, (R, ls, 


s®, s) — R is called the nonresponse error, whereas the 
third term R, — E,(R, | s, s°, s®) is called the impu- 
tation error. 


Using a first-order Taylor expansion, it can easily be 
shown that, under the NM approach, the imputed estimator 
(2.4) is asymptotically pq - unbiased; that is, E,,,, (Ras 
R) ~ 0. Also, under the IM approach and model (2.7), it 
can be shown that the imputed estimator (2.4) 1 
uae mpqI - unbiased under the IM approach; oo 
is, E npgl (R, — R) x 0. Thus, the imputed estimator is 
robust in the sense that it is valid under either the NM 
approach or the IM approach. Note that for the asymptotic 
bias to be equal to 0 under both approaches, we require that 
the sample size within each imputation class is sufficiently 
large. From this point on, we thus assume that the bias of 
R, is negligible. 


3. Variance estimation: 
The two-phase framework 


In this section, we derive variance estimators under the 
two-phase framework and the IM approach according to the 
method proposed by Sérndal (1992) and Deville and 
Sarndal (1994). Using the decomposition (2.8), the total 
variance of R , can be approximated by 


Vea (R, z R) = Je (R, mi RY 


Vea ee ee) 


where Va, = E,, V(R) = E,,Vsam) 1s the sampling 
variance of the complete-data estimator R, Vyg = E,,, 
V,.(E, (R, | s, 5, 8°) Rs, 5s”, 5) is “the non- 
response variance of the imputed estimator R yh Ge 
Be ie Ca ace on ee 2 oe ec: 
putation variance of the ee estimator R,, ANG Ve 

Ee URL Se os Vey eR Te 50, st 18a 
mixed component. Note that the expression (3.1) contains 
only one cross product term, 2V,,,,, because the other cross 


product terms are all asymptotically equal to 0. 
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3.1 Estimation of the sampling variance V,,,, 


er pees be the naive variance estimator of R Pig A ie i 
variance estimator obtained by treating the imputed values 
as observed values. The variance estimator V,,, is thus 
obtained by replacing e, by &, = 1/X, (¥, — R, ¥,) in (2.2) 
which leads to 


(3.2) 


= 2 Aye; 


ies jes 


As we show now in the case of simple random sampling 
without replacement, Vop, overestimates V,,,, under 
MRHD imputation whenever o,, > 0 (as it is usually the 
case in practice). After some algebra, we obtain 


En ean a ae) 


Vsam | 8, ee 


-a(e)iea- a) 
a ! 
reg is N A) n 
Expression (3.3) shows that Ves is mpqI- biased for 
Vem unless o,, = 0,7,,=n (which is the case of 
complete data) or n = N (which is the census case). The 
fact that V.,,, is not a valid estimator of V,,,, can be easily 
explained by noting that although MRHD imputation 
preserves the variability, s? and 5g) corresponding to 
variables x and y, it does not preserve the covariance, s,,, 
in (2.3). Indeed, imputation tends to underestimate 
relationships between variables that are positively 
correlated. As a result, V,,, overestimates V,,,, because 
of the presence of the minus sign in front of s,,, in (2.3). To 
overcome this difficulty, Sarndal (1992) proposed to 


: = 7 Oy 5 
estimate Vip = E.,, Vsam ~ Vorv | S> 5; > 5, ) DY : 
: 
mi-unbiased estimator V,,; ie., E_@ ane RR 


s) = V,,. However, the derivation of this component for 
an arbitrary design involves very tedious algebra in the case 
of a ratio. Therefore, we propose an alternative that does not 
require any derivation but involves the construction of a 
new set of imputed values. It can be described as follows: 
whenever a, =0 and/or 5, =0, select a donor j at 
random with replacement from the set of respondents to 
both variables y and x (i.e., the set of sampled units for 
which a, = 1 and 6, = 1) with probability w,/2)., w, a, 5, 
and impute the vector (x, y,). In other words, whenever 
one variable is missing, the observed value is discarded and 
set to missing; the missing values are then replaced by the 
values of a donor selected at random among the set of 
respondents to both variables x and y (often called the set 
of common donors). Similarly, when both variables are 
missing, the vector (x,, y,) ofa donor 7 is imputed. Then, 
use the standard variance estimator (2.2) valid in the 
complete response case using these imputed values. Let 
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Voy denote the resulting variance estimator. Note that this eae 
new set of imputed values is used only to obtain a valid 
estimator of the sampling variance and is not used to 


DIAG Dawes 

estimate the parameter of interest R. It can be shown that 1 |l4 si lea aot W; G; ‘ 
IES = a ies ies_ © Ss 

oes is an asymptotically mpgqI-unbiased estimator of 7 N? ? NN 5 


Vx, - In practice, one could, for example, create a variance 
estimation file containing the new set of imputed values and 
use standard variance estimation systems (used in the 


2 as 2 
complete data case) to obtained an estimate of the sampling 2 ae 2 ve 2 wae se 
variance. 4 W? uv er NN, RIS, 


3.2 Estimation of the nonresponse variance V,, 


An estimator Vi, of Vyp = Eke (Relises Duman oa ee i 2h . (3.5) 
SO) Rilisesc ls) can simply be obtained by te e NN, Oa aN NN, Roy (- 


mating V, (E, (R,.| s, s,s) — R | s, s™, s™). Using a 
first-order Taylor expansion, we obtain 
The estimator (3.5) is asymptotically mpgJ/- unbiased for 
Van Vp. In the special case of simple random sampling without 
replacement, expression (3.5) reduces to 


ie, (1 DoWbie De a Ge eee 
1 J)ies es zis ie Vip = =a 1 ss AGL All 
7 i i ne ae 


Tw, Det Tw] JAI Io 
4 ies ra iss opis [ : Oo, 
Ni N NN, 


3.3 Estimation of the imputation variance V, 


Anestimator V, of V, = E,,,,V;(R; — R | s, s™, s°) 


mpq 


ly be paired by estimating V, (R, Sos 
Re can simply y g 
io, 2m sib ae pa 4% 2m (3.4) seas) given by (2.6). An asymptotically /- unbiased of 
NN, N? NN, NN, i Pen [? Vy, (R, —R|s, s%, s) is then given by 


A 1 2 2 2 2 2 
of Se We) sa ie 4)sh (3.6) 
where (N, N,, N,) = Dies Ww, a, b,). Now, let So = : ee ; 2 i 


1/N Seg w) & ~ ¥,) and sy = WN Xie wi (Hs ~ vy Alle 
witht enai) =<! Dian) w,(&, 3). Note that 2 Se 2 It follows that V, in (3.6) is asymptotically mpql 


ial f simpl d 
denote respectively the sample variability of the x- values oe ae ee aa puteam aint. Preean a es! 
and ue y- values after imputation. It can be shown that 57. ereticas P > XP ! 


and is are sy SpE) ml- unbiased for aad Nie (: wis Siy _ p [1 7: Si | 
n 


ee 
the model variances So, and o. Also, let Sy = UN ic i nj n 


Dies W; 4; bi (%;, Se) Vs — a, ree N., = p2 eae mene nG| 


EL gal sit 


(Xia View ie N,, wea, GD: (Hs Ys): oe that (se is 


tal cel xy, 


m-unbiased for the model covariance o,,. It follows that Si OLR LSD AE INCH SST IOI ete 
Vp 1s obtained by estimating the unknown quantities in Finally, we obtain an estimator V.,. of Vigy by esti- 
(3.4), which leads to mating 
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E (eB e(RE is es) SAR POR -R) |p ss OF 


m 


Using a first-order Taylor expansion, we obtain 


E VCE; QR, |s, 8s) se 


~ 1 ies _ = ies rey 
m NN NO oly * 


Swe Dw], vy: 
ies ies ta 2 
+e on 
NW oy ie Abe 


Ds we q; pa w; b, > W; 
9) ies ies 2 ies H,, 
we aliygaticallie NR cintoaneee Merial atl 7) 
a b x 
An estimator of (3.7) is thus given by 
Vx = 
~wa Dw wb Dw 
l ies_ x ies s° ies” . SY ies R s° 
x; || NN, Nols? oa] uae Nae | aad 
2 we a; 3S we b, » w 
9) Hee ee aN ia B. “Ke R, syn ; (3.8) 
a b 


The estimator (3.8) is asymptotically mpqJ- unbiased for 
Vyrx- In the case of simple random sampling without 
replacement, the component Vie is equal to zero. More 
generally, the component Von is equal to zero for any 
unistage self-weighting design (i.e., a sampling design for 
which the sampling weight are all equal). For unequal 
probability designs, it is important to include the component 
ape because its contribution (positive or negative) to the 
overall variance could be substantial (Brick, Kalton and 
Kim (2004)). 

Finally, an asymptotically mpqJ- unbiased estimator of 
the total variance Voor = Vip) (R, — R) is thus given by 


7 ee 5 , + 
V. SOc Gaeer ak) tellin 


TOT 


4. Variance estimation: The reverse framework 


In this section, we derive variance estimators under the 
reverse framework and both the NM and the IM approaches 
according to the method proposed by Shao and Steel (1999). 
Recall that, under this framework, we require the additional 
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assumption that the response probabilities do not depend on 
the sample s. Under the NM approach, the total variance of 
R , can be approximated by 


V (R, - R) = E,V, E,(R, — R | a,b) 


E,,V,(R, — R|\ a,b) +V,E,,(R, -R| a,b), (4.1) 
where a =(4,....@,) and.b =(G,...,.5,) denote the 
vectors of response indicators to variables y and x, 
respectively. 

Under the IM approach, the total variance of R , can be 
approximated by 


V (R, — R) © E,,V, E,(R; — R | a, b) 
+ Eny,V,(R; — R | a, b) 
+ E,V,, Ey; (R; — R | a, b). (4.2) 


Under both the NM and the IM approaches, an estimator 
of the first term on the right hand side of (4.1) and (4.2) can 
be obtained by finding an asymptotically p/- unbiased 
estimator of V,, E, (R, | a, b). Also, the second term on the 
right hand ae of (4.1) and (4.2) can be estimated by V, 
given by (3.6). Under the NM approach, an estimator the 
last term on the right hand side of (4.1) can be obtained by 
estimating V, E.,, (R, | a, b), whereas an estimator of the 
last term on the right hand side of (4.2) can be obtained by 
estimating V,, E.,, (R, — R|a, b) under the IM approach. 
As a result, the estimators of the first two terms in (4.1) and 
(4.2) are identical and thus are valid regardless of the 
approach (NM or IM) used for inference. Only the third 
term on the right hand side of (4.1) and (4.2) will depend on 
the approach used. In the case of the IM approach, 
specification and validation of the imputation model is 
crucial to achieve asymptotic unbiasedness of the third 
component, whereas in the case of the NM approach, the 
asymptotic unbiasedness of the third component relies of the 
correct specification of the nonresponse model. 


4.1 Estimation of V, E, (R, — R|a,b) 


Using a first-order Taylor expansion and expression 
(2.5), an estimator of V, E,(R, — R | a, b), denoted by V,, 
is given by 


V, ‘ii > » Ai SiS). G3) 
ies jes 
where 
1/1 = Vee = 
S, 14 Oe) Paes , (%, | 
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In other words, the estimator V, is obtained from the 
complete data variance estimator (2.2) by replacing e, by 
€, In the case of simple random sampling without 
replacement, the estimator (4.3) reduces to 


2 2) 
Vy, = (1- 2)4) 4 ee 28 | | |, Ga) 
NX ie ie ea lece Ke 


4.2 Estimation of V, E,, (R, 
NM approach 


— R| a,b) under the 


First, note that 


Y_N, 


ER, hee i N. be ‘) 


where (Y,,N,)= 4% 1) and (X4,N,)= > .07 5; %p)). 


Using a first-order Taylor expansion, it can be shown that 
Ey (R, — R | a, b) canbe approximated by 


z f= = 
1 Py lee (; Ps) 2 
E_,(R,—-R\a,b)»—— Sah 
sdcmanesel2 4(2 5 
Pay all 
— DR) ge Sle 
[ PxPy Js. | ig 
where 
ETA | ma yas 
Sy =H p 2, 0 Fy 
ies ba Oy eS a, 
Pale 
and 
8, = 4d &- 00-7) 
icU 
with 


oS eee Vial 


An estimator of V, E., (R, —R{|a,b) is obtained by 
estimating unknown quantities in (4.5), which leads to 


, ew, a(1— p 
poms tLe Bay sh + Bi aa: 
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A 


ae 5 
= eae reals fa, — Na 


where Pp = , 
Ey N N N 


The estimator (4.6) is asymptotically pqJ/- unbiased for the 
approximate variance (4.5), noting that Se s;, and s,,, are 
asymptotically pq/- unbiased for S°, S ig and S,,,, respect- 
tively. In the case of simple random sampling without 
replacement, the estimator (4.6) reduces to 


2 2 
ONO ML, Wheritie s)he a2 a Tels 
evn ls 3) aah ik a r. 
MD Al ett tel all. DAP De 

aft e cosas Vy } aD 


4.3 Estimation of E,V,, £,,(R,; — R | a,b) under 
the IM approach 


Using a first-order Taylor expansion, it can be shown that 
EV, =p: (R; — R | a, b) can be approximated by 


1 1 IT Aver: 
EV ,E(R, - R | a,b)~ -1|,? 
ab mE pi(B; — R | a,b) iene +) 


+(e) (eto HE (eats) Hn} 


where N,, = Liev 4,5, An estimator of £,V,, E., R= 
R | a, b) is obtained by estimating unknown quantities in 
(4.8), which leads to 


PD eA oe eS ler |S 
2 Re N, N ly vi N, N ik 


MN i 
—2R,| ~*% -=|s,, |. (4.9 
‘Gt | | 


The estimator (4.9) is asymptotically mpgqI- unbiased for 
the approximate variance (4.8). It is interesting to note that, 
under weighted MRHD imputation, the estimator V™ in 
(4.6) obtained under the NM approach is identical to pee 
in (4.9) obtained under the IM approach. However, this may 
not be the case with a different imputation method. Also, the 
component V, is negligible with respect to V, when the 
sampling fraction n/N is negligible, where V, stands for 
VE™ or V{™. In this case, the component V, may be 
aitied from the calculations. 

Finally, an estimator of the total variance under the 
reverse framework is given by 
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Under the reverse framework, both the NM approach and 
the IM approach lead to the same estimator of the total 
variance. Thus, the variance estimator V8") is robust in the 
sense that it is valid under either the NM approach or the IM 


approach. 


5. Summary and conclusions 


In this paper, we have derived variance estimators for the 
imputed estimator of a ratio under two different frame- 
works. The reverse framework facilitates the derivation of 
the variance expressions (in comparison with the customary 
two-phase framework), especially if the sampling fraction is 
small, in which case we can omit the component /,. 
However, unlike the two-phase framework, it requires an 
additional assumption that the response probabilities do not 
depend on the realized sample s. Also, the two-phase 
framework uses a natural decomposition of the total error 
that leads to a natural decomposition of the total variance. 
That is, the total variance can be expressed as the sum of the 
sampling variance, the nonresponse variance and the 
imputation variance, which allows the survey statistician to 
get an idea of the relative magnitude of each component. 
Under the reverse approach, there is no easy interpretation 
for the variance components (except the imputation 
variance). 

We have considered the case of weighted MRHD 
imputation within classes. Another version of weighted 
random hot-deck imputation, which we call weighted joint 
random hot deck (JRHD) imputation, is identical to 
weighted MRHD imputation, except that when both 
variables are missing, a donor 7 is selected at random from 
the set of common donors (i.e., the set of respondents to 
both variables y and x) with probability w,/%).,w, a, }, 
and the vector (x,, y,) is imputed. This version of the 
method helps preserving relationships between survey 
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variables, contrary to imputing independently each variable. 
The results for JRHD imputation can be obtained using 
similar techniques presented in this paper. Finally, the 
results presented in this paper can be easily extended to the 
case of both deterministic and random regression imputation 
performed within imputation classes. 
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Efficient bootstrap for business surveys 


James Chipperfield and John Preston ' 


Abstract 


The Australian Bureau of Statistics has recently developed a generalized estimation system for processing its large scale 
annual and sub-annual business surveys. Designs for these surveys have a large number of strata, use Simple Random 
Sampling within Strata, have non-negligible sampling fractions, are overlapping in consecutive periods, and are subject to 
frame changes. A significant challenge was to choose a variance estimation method that would best meet the following 
requirements: valid for a wide range of estimators (e.g., ratio and generalized regression), requires limited computation time, 
can be easily adapted to different designs and estimators, and has good theoretical properties measured in terms of bias and 
variance. This paper describes the Without Replacement Scaled Bootstrap (WOSB) that was implemented at the ABS and 
shows that it is appreciably more efficient than the Rao and Wu (1988)’s With Replacement Scaled Bootstrap (WSB). The 
main advantages of the Bootstrap over alternative replicate variance estimators are its efficiency (i.e., accuracy per unit of 
storage space) and the relative simplicity with which it can be specified in a system. This paper describes the WOSB 
variance estimator for point-in-time and movement estimates that can be expressed as a function of finite population means. 
Simulation results obtained as part of the evaluation process show that the WOSB was more efficient than the WSB, 
especially when the stratum sample sizes are sometimes as small as 5. 


Key Words: Variance; Bootstrap; Stratified sampling. 


1. Introduction 


In 2000, the Australian Bureau of Statistics (ABS) first 
obtained a register of businesses containing taxation data 
from the Australian Taxation Office (ATO). The data items 
included turnover, sales, and other expense items. In 2001, 
the ABS used this register as a sampling frame for some 
surveys in order to improve the efficiency of its sample 
designs. This data is updated for each business at least 
annually. To make maximum use of these administrative 
data items in estimation the ABS developed a generalized 
estimation system called ABSEST, with the capability of 
supporting generalized regression estimation (GREG) and 
variance estimation. ABSEST has been routinely used for 
the monthly ABS Retail Survey since July 2005. 

A generalized estimation system is highly desirable for 
statistical agencies as it supports a variety of survey output 
requirements at high levels of statistical mgor for an 
acceptable cost. The ABS has invested considerable 
resources into its generalized estimation system for business 
surveys. Prior to 1998, the ABS’s generalized estimation 
system was capable of Horvitz-Thompson, ratio, and two- 
phase estimation with variance estimates based on Taylor 
Series (TS) approximations. In 1999, the Taylor Series 
method was replaced with the Jackknife method. 
Subsequent feedback about the computer design and 
usability were that changes to the generalized estimation 
system made it increasingly complex to maintain and 
develop and that processing time could be undesirably long. 
These key features were important when choosing the 
variance estimation method for ABSEST. 


Core survey output statistics for ABS business surveys 
are estimates at a point in time, estimates of movement 
between two time points, and estimates of rates. Business 
surveys are equal probability designs within stratum, are 
highly stratified (100s of strata), can be either single or two 
phase sample designs, and for surveys that sample on more 
than one occasion the overlapping sample can range from 0 
to 100%. The sample size for business surveys range from 
less than 1,000 to 15,000; stratum level sample sizes can be 
as low as 3 and as high as several hundred. 

Section 2 introduces the GREG estimator. Section 3 
discusses alternative variance estimators for GREG and 
justifies why the Bootstrap variance estimator was chosen 
for ABSEST. Section 4 describes the Without Replacement 
Scaled Bootstrap (WOSB) and Rao and Wu (1988)’s With 
Replacement Bootstrap (WSB) variance estimators for 
point-in-time estimates under single-phase designs. Section 
5 describes the WOSB for movement estimates. Section 6 
measures the bias and variance properties of WOSB and 
WSB in a simulation study. Section 7 gives some 
concluding remarks. 


2. Generalised regression (GREG) estimator 


In this section we briefly describe the GREG that is 
implemented in ABSEST. Consider a finite population U 
divided into H strata U = {U,, U,, ..., U,,}, where U, is 
comprised of N, units. The finite population total of 
interest is Y = }',Y,, where Y, = Dieu, ¥,; and h = 1, ..., 
H. Within stratum h, the sample s, of 7, units is selected 
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from U, by Simple Random Sampling without Replace- 
ment (SRSWOR). The complete sample set is denoted by 
WORDS. Shy ely. STS 
Consider the case where a K vector of auxiliary 
variables x; = (x; ... , Xx) is available for i € s 
and the corresponding vector of population totals 
X =>; x; ate known. The GREG estimator (Sarndal, 
bee and Wretman 1992, page 227) is given by 
Yoon = Lies Wj Yi = Lies WY; + (X - x)’ B, where w, = 
Ww, 2, Ww, = N,/n, B= an with T' being the gener- 
alised inverse of T, X = ic Ww, X; an gi = (1+ 
oO, x! TS Gat SiO) T = Dees Wk x! om heey iS 
Daa Ws ale c., co, is a constant motivated by the 
superpopulation model y, = x/B +e, such that ¢, is 
independently and identically distributed with mean 0 and 
variance o*, and E(B) = = f. It is well known that Y,,, is 
unbiased to O(n''). The weights w, are stored for ready 
calculation of estimates. In practice bounds will be placed 
on the weights, w,. If the weights, w,, given by the above 
equation, are outside these bounds, they are calculated 
through iteration (see Method 5 of Singh and Mohl 1996). 
The expression for te can be adapted to a range of 
estimates, including domains and multi-phase (see Estevao, 
Hidiroglou and Sarndal 1995). For example, when x, = 1, 
an becomes the Horvitz-Thompson estimator given by 


Y= %, Wy, Dies, Yai With cehisteg variance Var (Y) = 
- 7 n, (l fi) Sy where Ce =(n, 1), ‘Dies, Vai - Vp)» 
Vn = Lies, Vai! and f, = n/N, 


0) Mee 


3. Comparison of alternative variance estimators 


The ABSEST variance estimation method was required 
to have bias and variance properties that were competitive in 
simulation studies, when compared with alternatives in the 
literature. In order to simplify the maintenance and 
development of the system, the variance estimation system 
specifications were required to be generic such that all 
calculations were largely independent of the estimator. 
(ABSEST need only support SRSWOR within stratum and 
single stage sample designs). Also, strong consideration was 
given to minimise the computational costs. 

Firstly, we also considered the Bootstrap, Jackknife and 
Balanced Repeated Replication (BRR) methods (Shao and 
Tu 1995, Rao and Wu 1988). Consider estimating the 
variance of a function 6 = 6(¥), where Y isa P vector of 
estimates Y = ¥..,w w, y; y; 18 the response vector from 
unit i with elements y,,, and @ is a smooth function. 
Estimating the variance using a replication method involves 
the following steps: 


(1) independently sub-sampling from the set s a total 
of R times; 
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(ii) for each of the R sub-samples computing w, = 
b-w,, where b, depends upon the number of times 
unit 7 is selected in the sub-sample; 

(iii) calculate 6° = 6(Y"), where Y" = Yj.,w, y,; 

(iv) estimate variance of 0. by Vat rep (6) = 
(R -1)'r%, 6 - 6), where 6 is the esti- 
mate of @ based on the r™ replicate sample. Note: 
the expression for replicate weights, w, = b,w,, 
includes the Jackknife, Bootstrap and Balanced 
Repeated Replication as special cases. 


; As we can express Y,., by a function 6, the variance of 
ne can be calculated by the above steps where specifically 
se (111) and (iv) respectively become: _{itt) calculate 

Yer = Lies W, ‘y,, Where w, = w,g, and g, has the same 
form as g, but is calculated using the weights w, instead 
of the weights w, for ies; (iv) Sane variance by 
Vata.) = (RD > ne Toe 

The attractive feature of these replication methods is that 
only the selection of the replicate samples and the value b, 
is required to calculate unbiased variance estimates for 
many commonly used sample designs and for estimators 
that have good first order Taylor Series approximations. 
Also if the replicate weights, , are stored the variance 
estimates of ae require simple calculations that can be 
completed in a short time; this approach of storing replicate 
weights has been applied successfully by the ABS’ 
generalized estimation system for household surveys. Once 
replicate weights are available, calculation of variance for a 
variety of analysis, such as linear regression, involves 
simple calculations that require little time and does not 
require the analyst to have knowledge about the sample 
design. Next we consider the relative merits of some 
replicate variance estimators for implementation in 
ABSEST. 

The drop-one Jackknife forms replicate samples, s, by 
dropping one unit at a time. This implies that R = n. For 
large-scale surveys this storage requirement is excessive. 
The delete-a-group Jackknife, while reducing R_ by 
dropping a group of units within a stratum at a time, would 
still have at least R = 2H replicates- a minimum of two 
groups per stratum is required to calculate variance. Despite 
performing well in an empirical study where n, = 2 (see 
Shao and Tu 1995, page 251), the Jackknife was rejected on 
the basis of its excessive storage requirement. 

For stratified designs the scaled Balanced Repeated 
Replication (BRR) requires approximately R = H repli- 
cate weights. Firstly, the replicate samples are formed by 
randomly splitting the stratum sample s, into two groups 
then allocating one of these groups to S, for each 
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h=1,..., H. The allocation of groups to replicates, 
defined by a Hadamard matrix, is done in such a way to 
eliminate between stratum covariance in the replicate 
samples. The Grouped BRR (GBRR) (see Shao and Tu 
1995) can arbitrarily reduce R at the cost of introducing 
between stratum covariance in the replicate samples. 
Preston and Chipperfield (2002) showed in an empirical 
evaluation for a typical ABS business survey that BRR (and 
GBRR) was significantly more unstable than the Bootstrap. 

In their summary of the literature, Kovar, Rao and Wu 
(1988) found that the scaled Bootstrap tended to have a 
larger bias compared with the Jackknife or TS when esti- 
mating the variance of GREG estimates. As the relative 
assessment of these methods varied according to the 
underlying simulation model and the stratum sample size it 
was important to make an assessment that was based on a 
model and sample design that were typical of ABS business 
surveys. Section 6 shows these properties to be acceptable. 
Unlike the other replication methods, the value of R for the 
Bootstrap may be chosen arbitrarily and so meet storage and 
computation restrictions. Further, the selection of the 
Bootstrap replicate samples is more easily specified in a 
computer system compared with selection of the BRR 
replicate samples. 

We considered the relative merits of a number of other 
variance estimators for implementation in ABSEST. The TS 
method was not suitable as its variance expression for 
complex estimands involves many terms specific to the 
estimand, making it difficult to adapt into a generalized 
system. This problem was addressed by Nordberg (2000) 
who described a method for variance calculation that 
automatically generates the Taylor Series expansions. They 
implemented the method in a computer system, called 
CLAN. CLAN can handle any function of means under 
Probability Proportional to Size (PPS) and cluster sampling. 
A limitation of CLAN is that it does not produce replicate 
weights which support complex analysis, such as regression 
analysis, either within or outside a statistical agency; it 
requires knowledge of the sample design; and it would be a 
relatively complex system to specify and maintain in a 
generalized system (in comparison to the Bootstrap). For the 
same reason other linearized variance estimators (described 
in Estevao, Hidiroglou and Sarndal 1995 and evaluated in 
Yung and Rao 1996) were rejected, despite good theoretical 
properties, good empirical results and being computationally 
efficient. 

On the above considerations, the preferred variance 
estimation method for ABSEST was the Bootstrap. In the 
next section we describe the WOSB and WSB, where only 
the former is implemented in ABSEST. 
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4. Without replacement scaled bootstrap (WOSB) 
for point in time estimates 


4.1 Method 


For point-in-time GREG estimates, the Without Replace- 
ment Scaled Bootstrap (WOSB) variance estimator involves 
repeating the following R times: 


(a) forming the set s” by selecting m, units by 
SRSWOR from s, independently within each stra- 
tum A =1,..., H, where m, =([n,/2] and the 
operator [.] rounds down its argument down to the 
nearest integer; 


(b) calculating w,, = w,,(1-y, +7, ,/m, 5,,) for 
ies, where y, = J(\- f,)m,/(n, —m 
is lif i € s, and 0 otherwise; and 


(c) calculating w,, = * 2, tori es. and 


(d) calculating the r" Bootstrap estimate of Y, 
Dies W W,, y,. The justification m, =[n, /2] is given in 
section 4.2. The Bootstrap variance estimator 1s 
given by the Monte Carlo approximation, 
Vara (Foug)= (R-D! Df (Peg? Veg) - The WSB 
method is the same as WOSB except that the rep- 
licate samples are selected by SRSWR and the 
scaling factor is instead y, =./(1—,,)m,/(n,-1), 
where m, is often set to m, —1 in the literature. 
Preston and Chipperfield (2002) found that WOSB 
was found to have significantly less replication error 
than the WSB- the error due to replicate sampling 


and conditional on the sample set. 


It is easy to see that the WOSB and WSB estimators are 
unbiased estimators of Var (6). The TS approximate 
variance is given by Var (6) = V’6V' (Y)V6, where V’(Y) 
isa P x P matrix with elements 


os ple Netlte 
Cov(?,,,¥,) = “ false Dibigoy 
where 
Sis = n is “fae ae Vp; -_ Vy); 
- = Fn 
a3 i  . 
forips p= lynt, xP) and, VOS(Oroy fens 07 AY5)| |. This 


easy to see that 
E, (Var (6°)) = VO'E.(V(¥")) V6 

by noting that 
ECO day 


= V'6V (Y)V6, 
Cov(?,, Y,) 
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where E, denotes the expectation with respect to re- 
sampling. Note the scaling constants applied to w,, to 
calculate the replicate weights are chosen so that the correct 
finite population correction factor is obtained. It therefore 
follows that the Monte Carlo approximation to the variance, 
Var s(6)= (R- 1) 1%, (6 —6)*, is unbiased for Var (6). 


4.2 A note on the relative efficiency of WSB and 
WOSB sampling 


To simplify notation, let ¥,,.. = Vars(0). The variance 
of the Bootstrap variance estimator can be written as 


Var Poor) = Vat, (Ee [%oot | S]) + £, LV ate Ocoee |S], 


where s denotes the expectation with respect to the sample 
design. If %,,,, is unbiased (ie, E.[%,,., | s] = Var(6)) 
then Var(¥,,,,) does not depend upon how the replicate 
samples are selected. The term Var(¥,,,,|5) 1s the 
replication error conditional on the sample and is inversely 
proportional to R. The value of R is chosen to be 
sufficiently large such that Var,(Z.[¥,,,. | s]) 1s small 
relative to 1... the estimated sample variance. The 
efficiency of two Bootstrap estimators can be compared by 
the size of Var,(Z.[%,,o | S]) when both estimators have 
the same value of R. Next we summarise empirical results 
based on actual data that show the WOSB can be 
significantly more efficient than WSB. The benefits of 
efficiency are either reduced computation time and/or more 
accurate variance estimates. 

Preston and Chipperfield (2002) compared the efficiency 
of WOSB with m, = [n,/2] and WSB with m, =n, — 1 
(see Rao and Wu 1984) for the Australian Quarterly 
Economic Activity Survey in March 2000. This survey has 
a stratum level sample size that varies from 4 and into the 
100s. The results (derived from Preston and Chipperfield 
2002, Table 1) show at the national level the size of 
Var, (Ex [Poor | 5]) was 54% smaller for WOSB compared 
with WSB sampling when R=100 (See Preston and 
Chipperfield 2002 for more empirical estimates of 
Var, (Ex [Yoo | S]) for WSB and WSOB). In other words, 
WOSB required about half the number of replicates to 
achieve the same replication error as WSB. This represents 
a significant efficiency gain. Another benefit of WOSB over 
WSB is that the computational time in selecting the replicate 
samples is considerably less. 

From empirical investigations, the choice of m,= 
[7,/2] for WOSB minimized Var,(E.[¥,,,. | s]). As n 
increases, we suspect that the difference between WOSB 
and WSB will reduce to approximately zero. More work 
needs to be done to establish these properties. 


Statistics Canada, Catalogue No. 12-001-XPB 


5. Movement variance between 
single phase estimates 


A key output requirement of many business surveys is 
the estimate of change between two time points. Denote the 
finite population at time t by U® = (UU, U®, ..., UP}, 
where U;” is the stratum h population at ‘irae t that is 
made up 'o ING? in The population total at time ¢ is 
Y° =¥,Dieu: Yyy. Estimating the variance of A® = 
Y — Y“-. the difference between two time periods, is the 
focus Of this section. The terms corresponding to n,, /, 
and s’ at time ¢ are denoted by n\?, f and s\” 
Keapexiively. When sampling on two occasions define J, 
n,.. n>, and n\? to be the number of units in the elles 
setsiie Uae ld Oy 8? es ge SU cals) — xitabeand 
= 50 0 ersten In ABS hires surveys the 
time | evans of size n\ is an SRSWOR from U\”. The 
time 2 sample is the union of the following two samples: an 
SRSWOR of n,, units from s\? and an SRSWOR of n°? 
units from U;, @) ee (Ul ©) nly o y, The time 2 sample is 
effectively an SRSWOR from U\”. At the ABS, the size of 
the overlapping sample, n,., 1s controlled by the Permanent 
Random Number method (see Brewer, Gross and Lee 
1999): 

The estimator of Var (A) can be expressed as 


Var (A) = Var(Y) + Var(Y™) - 2Cov(y™, ¥)”), 
Consider the Horvitz-Thompson estimator A = Y° — ¥®, 
where t = 1, 2 and Y’ is defined analogously to Y. Tam 


(1985) show that when U(” = U)”, an unbiased estimator 
of Var (A) under the above sampling scheme is 


Var (A) = Var (Y)+ Var (Y°) -2Cov (¥, ¥), 


where 
Var (2) = 7 A - fs? /nf?, 
h 


Lit 


Ro ee Ws 7 Gis ¥,) Yr; — Vo), 


ies 


pas 1 2 12 Da 
Cov (Y, y‘ ) = Kia, ») ch Nh Kn nk dy, 


for i= Ie t2rand’ fs Leal a 
When U;,)? #U i a more aon form of Tam’s 
estimator is given by Var (A), except that 


Var (2) = Y NPA — fs? /n®, 
h 
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Cov(Y™, y) = NT EN (ny, Ny, ye seal kp) iy ' 
h 
and 
(2) 
fo, = ny ty Nie 
; QQ) nz (2)° 
Nye Nj, N, 


For the reminder of this section we assume that Var (A) is 
unbiased for Var (A) when U,, O 4 Uy °) (It is worthwhile 
noting that Var (A) can ie, negative values when 
U ie aay Bi on Nordberg (2000) gives an unbiased estimator 
of Var(A) for the regression estimator when U - 7 Wie. 
but there is no obvious way in which it can be used with the 
Bootstrap as described in this paper.) 

Estimating the variance of A,. = Ye) — Y<2, the 
movement between GREG estimates at times | and 2, using 
WOSB involves repeating the following R times: 


(a) forming the set s by independently selecting 
ieee mi = ny ft and Ta [n /2] 


units by SRSWOR from the sets s,., s2, and s\” 
respectively; 


(b) for i € s\” calculate the replicate weights 


1 
On ne | | Neh ae Ach oe 
a Nj, —~ Yeh — a) 0 — Yich (a) n ad see 


fori € s,., 


wi = Mish Ms sr(l) 
= |1- YY Vien = OH 


hy, Mysp 


for i < s\), where 


tig AL alia Meg Le Sir, n)] Mcp 


{icy (Men — Mem) $ 


y 


(1 i Sion) Mp (ney 7% M,,,) 


and 5, equals 1 if unit i is selected in the 
replicate group at time point t and zero otherwise; 


(c) calculate weights defined analogously for i € s\”; s 


(d) calculate w,° = we for ies”, s”, where 
ee has the same form as g, but is calculated 


wath the weights w,\” instead of w\?; 
(e) calculate Awe = iO Sewers GY nS 
Die? Wy, The WOSB variance estimator is 
given by 
=a - A” x 
Vars = =k we Ne 
re] 


x 2 
a) > 
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ies? 


The proof that Var (Ags: ) is unbiased is  straight- 
forward and is similar to the proof that Var, (6) is unbiased 
(see section 4). 

The approach described above requires a separate set of 
replicate weights for movement and level variance esti- 
mates. Roberts, Kovacéevic, Mantel and Phillips (2001) con- 
sider approximate Bootstrap variance estimators of move- 
ment that only use the level replicate weights, hence re- 
ducing computational costs and simplifying the method and 
its implementation in a computing system. 


6. Simulation study 


This section summarizes a simulation study for point-in- 
time and movement estimates carried out to empirically 
measure the bias and variability of WOSB and WSB over 
repeated sampling when R=100. A population was 
generated at time points 1 and 2 from the following models, 

y = (0.75x, + 0.25x,,) WO, 2.5, 1) ey a 
‘ie va W(0, 5, 1), where the auxiliary variables are given 
DY ak O23 SO TOALUUL(U LL LT. sand, ye = 
100L (0, 1, 1) where W(u, y, a) and L(y, y, a) are the 
Weibull and Log-normal distributions with location, shape 
and scale parameters given by up, y and a. These 
distributions reflect the long tails that are typical of 
economic survey data. The times 1 and 2 populations were 
of size 3,000, with 2,500 population units common to both 
time points. Each population unit, i, was assigned to one of 
5 strata at both time points using z, where 
z,=x,,W (0,2.5,1) and the stratum boundaries were 
z, = 50, 100, 150, 250. This resulted in stratum popula- 
tion sizes that ranged from 400 to 1,000. 

A total of 3,000 simulated stratified SRSWOR were 
taken from the population at times 1 and 2, where n\? = 
Liat ee ane , eae ota Mei ana Led 
For WOSB the replicate sample sizes are given in sections 4 
and 5. For WSB the replicate sample sizes for movements 
were m,,=[n,,—1], m9 = [n9 -1] and mS = [nS - 1] 
and for levels were m\” = n\? —1. The WSB estimator for 
movements has the same form as WOSB but has a slightly 
different scaling factors and takes replicate samples with 
replacement. 

From each of the 3,000 simulated samples Len is 
calculated, where Y2, is given by SCTE TS Fee oe 


reg 
oc, =1 and j =1, 2, ..., 3,000. The true standard error of 
Y.., is calculated by 


ie 3,000 ; 
5 = 15909 2, fee ~ oc 
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The Bootstrap’s estimated standard error of is from the 
th 


j sample is 
l we 2) ie 
oP AG 
100 2 eee ) 


where Y!""” is defined analogously to Y.“”. The Relative 


reg ace 


Bias (RB) of the Bootstrap’s standard error 1s 


3,000 


aS 1 Quy 
RB(S) = 3 9005 yy (S” —'S). 


The Relative Root Mean Squared Error (RRMSE) of the 
Bootstrap’s estimated standard error is 


. 1 3,000 
Se aaa r000 000 2 re) 


Similar definitions for RRMSE and bias are used when 
estimating the movement variance. The 95% coverage 
probabilities, the percentage of 95% confidence intervals 
containing the true population total, of WOSB and WSB for 
levels and movement are also compared. 

The results in Table 1 show that the RB and the RRMSE 
of the WOSB and WSB are both acceptably small. The bias 
of WSB’s time point | estimates are slightly higher than 
WOSSB resulting in slightly worse coverage probabilities. 


Table 1 
Bootstrap estimate of the standard 
movements and point-in-time estimates 


error for 


Time point 1 Movement 
Method RB RRMSE C95% RB RRMSE C95(%) 
WOSB 40:7 917.3 OAT Tae) gaa O5%3 
WSBi cd Lap $1528 LEN Pe BOTS ee 94.6 


7. Summary 


From the simulation results, both the WOSB and WSB 
were considered to be reliably accurate over repeated sam- 
pling. Conditional on the sample, the WOSB was found to 
be significantly more efficient (up to 50%) than WSB for 
stratified sampling when the stratum sample size is 
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sometimes small. As a result, the WOSB was implemented 
in ABSEST. 
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Bayesian estimation in small areas when the sampling design 
Strata differ from the study domains 


Jacob J. Oleson, Chong Z. He, Dongchu Sun and Steven L. Sheriff ' 


Abstract 


The purpose of this work is to obtain reliable estimates in study domains when there are potentially very small sample sizes 
and the sampling design stratum differs from the study domain. The population sizes are unknown as well for both the study 
domain and the sampling design stratum. In calculating parameter estimates in the study domains, a random sample size is 
often necessary. We propose a new family of generalized linear mixed models with correlated random effects when there is 
more than one unknown parameter. The proposed model will estimate both the population size and the parameter of interest. 
General formulae for full conditional distributions required for Markov chain Monte Carlo (MCMC) simulations are given 
for this framework. Equations for Bayesian estimation and prediction at the study domains are also given. We apply the 
1998 Missouri Turkey Hunting Survey, which stratified samples based on the hunter’s place of residence and we require 
estimates at the domain level, defined as the county in which the turkey hunter actually hunted. 


Key Words: Hierarchical Bayes; Markov chain Monte Carlo; Bayesian prediction; Random sample size; Spatial 


correlation; Stratification. 


1. Introduction 


Small sample sizes occur often when analyzing sample 
survey data. These small sample sizes arise frequently when 
studying subpopulations such as socio-demographic groups. 
We may also consider spatial regions and time periods as 
subpopulations or study domains. Due to small sample 
sizes, direct survey estimators could be highly unreliable. 
Estimation stemming from areas with small sample sizes 
has been termed small area estimation (SAE). Rao (2003) 
gives a nice review of many SAE techniques. Some recent 
small area review papers are found in Rao (2005) and Jiang 
and Lahiri (2006). 

Appropriate models are needed in order to produce 
reliable small area statistics. Different model-based methods 
include the empirical best prediction method (see Prasad 
and Rao 1990, Jiang, Lahiri and Wan 2002, Das, Jiang and 
Rao 2004, Jiang and Lahiri 2006) and Bayesian methods 
(see Malec, Sedransk, Moriarity and LeClere 1997, Ghosh, 
Natarajan, Stroud and Carlin, 1998, He and Sun 2000). For 
a good review on Bayesian small area estimation, the 
readers are referred to Rao (2003). This paper concerns a 
practical implementation of Bayesian methodology. One 
critical step in implementing a Bayesian model is selecting 
prior distributions. The propriety of the posterior distri- 
bution and the robustness of the priors should be carefully 
examined. When an MCMC simulation such as Gibbs 
sampler is used in the computation, the convergence of the 
Gibbs chain must be monitored. For more details, see Carlin 
and Louis (2000). 


We consider a stratified cluster sampling design where 
within each stratum clusters are selected by simple random 
sampling without replacement. In our application, clusters 
are of unequal sizes and the cluster sizes are unknown at the 
time of designing the survey. We consider a domain esti- 
mation problem where domains cut across the design 
clusters. As a result, domain population size is unknown and 
domain sample size is random. In our application, realized 
sample sizes for the domains are small and as such standard 
design-based domain estimation techniques (see Cochran 
1977, Lohr 1999) are unreliable. We propose a fully 
Bayesian hierarchical model to get around the problem. 

We begin by obtaining estimates of success rates and 
population sizes simultaneously at the small area level for 
individuals from each of the design strata. The estimates are 
obtained by borrowing information from the neighboring 
small areas. This is done through a spatial structure built 
into the Bayesian model. Therefore, the resulting estimates 
are much more stable than the direct survey methods. We 
then compute a weighted average of the success rates from 
the design stratum for the final small area estimate. For 
example, if a county is the small area, we compute a county- 
specific success rate estimate for each of the design strata 
and average them for a single county estimate. To combine 
the sampling design strata, the individual stratum population 
sizes should be known. In the case where these population 
sizes are not known, they can be estimated using our 
proposed model. This work is motivated by applying 
Bayesian methods in estimating the turkey hunting success 
rates at the county level in Missouri. We propose a new 
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family of generalized linear mixed models with correlated 
random effects when there is more than one unknown 
parameter. We will call this a bivariate generalized linear 
mixed model (bivariate GLMM). A generalized linear 
mixed model (GLMM) with possible correlated random 
effects is often used when there is only one unknown 
parameter (Sun, Speckman and Tsutakawa 2000). The 
proposed model estimates both the population size and the 
parameter of interest and, hence, advances the current state 
of small area research. 

The above-described scenario is present in the 1998 
Missour1 Turkey Hunting Survey (MTHS). This was a 
spring postseason mail survey that provided the Missouri 
Department of Conservation with information concerning 
the number of turkeys harvested by hunters on each day of 
the hunting season and in which county the harvest 
occurred. Also, the total number of trips made to the 
counties by these hunters on each hunting day was recorded. 
Hunting success rates were then calculated from this 
information. 

The MTHS example is presented in detail in Section 2. It 
is followed in Section 3 by a summary of the proposed 
methodology and by general formulae. Although we esti- 
mate success rates, the methodology is generalizable. We 
also give general formulae to find the estimates and pre- 
dictions for the small areas as well as full conditional 
distributions for use with MCMC simulations. Final 
comments are given in Section 4. 


2. 1998 Missouri turkey hunting survey 


2.1 Background to 1998 MTHS 


The Missouri Department of Conservation began 
biennially in 1986 to track hunter tendencies with the 
MTHS. This survey asked the hunter what county he/she 
hunted in, on what day that occurred, and if the hunt was 
successful or not. It began as a simple random sample of all 
spring turkey hunting permit holders. He and Sun (1998) 
used the information from the 1996 survey to estimate 
turkey hunter success rates in all 114 counties of Missouri 
with a Bayesian Beta-Binomial model. He and Sun (2000) 
estimated county-specific hunting success rates per week of 
the hunting season. Only one harvested turkey was allowed 
per week in 1996 spring hunting season. They used a 
GLMM and estimated only success rates. Oleson and He 
(2004) extended this model in order to estimate hunting 
success rates for each day of the hunting season. They found 
significant auto-correlation among the days of the hunting 
season and among the counties of Missouri when estimating 
success rates. 
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In 1998, the MTHS sampling scheme was changed. The 
frame is still the list of all turkey hunters registered to hunt 
in Missouri and contains, among other things, information 
on each hunter’s county of residence. Simple random 
samples used previously put too much weight on the heavy 
population masses of Kansas City and St. Louis. Hence, 
counties near these metropolises received large samples and 
counties further away (e.g., the southern tiers) received 
insufficient samples. Ideally we would like to stratify by the 
county where hunters pursued turkeys to draw samples 
representative at this domain level. Information about where 
the hunters pursued turkeys is unavailable until after 
questionnaires are returned, meaning this type of strat- 
ification is not possible. 

An alternative is to stratify by where the hunter lives 
since hunters tend to hunt near where they live (in locations 
with which they are familiar). This causes a problem in 
estimating the parameters of interest which are hunting 
success rates per county. We would like estimates based on 
hunting location, but the sampling design uses the hunter’s 
place of residence. The new sampling design of MTHS is a 
stratified simple random sampling of clusters with unequal 
sizes. In this case, a cluster represents a registered hunter 
and its elements are the hunting trips for that hunter. The 
hunter’s place of residence is used as a stratification factor. 
The design strata are: 1) Non-residents of Missouri, 2) 
Residents of Northern Missouri, 3) Residents of Southern 
Missouri, 4) Residents of St. Louis metro area, and 5) 
Residents of Kansas City metro area. Figure 1 shows the 
boundaries that were used in determining the four Missouri 
resident strata. These are based on the first three digits of the 
postal zip code. Proportional allocation was used to 
determine the number of sampled hunters in each stratum. 
As shown in Table 1, there were 110,691 total permit 
holders. A sample of 8,000 was proportionally allocated to 
each of the five strata. In Table 1 the column “% of Sample” 
refers to the percent of the overall sample that comes from 
that particular stratum (sample replied/total sample replied). 
The column “% of Strata Sampled” contains the percent of 
hunters who were sampled from that stratum (sample 
replied/total permits). The number of clusters in the 
population and the number of clusters in the sample for each 
stratum are known fixed numbers. The hunting trips taken 
by each hunter are the elements within the cluster. 
Throughout the remainder of the paper, population size is 
the number of hunting trips taken by all hunters in the 
frame, termed hunting pressure. For the MTHS example, 
the population size is unknown but not random; the sample 
size is best considered random since we do not know how 
many trips each hunter will take. Note that a hunter does not 
have to hunt in his/her design stratum. The hunter may hunt 
in many different counties during the season. For each study 
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domain (i.e., county), the population size is again unknown 
but not random; the sample size is again random where we 
do not know in which county a hunter is going to hunt. 
Small sample sizes can still result; for example, Dunklin 
County, Pemiscot County, and New Madrid County in 
southeast Missouri had zero trips reported to them 
throughout the 3-week season in 1998 and Lawrence 
County in southwest Missouri had zero trips on the first day 
and only five the remainder of the first week. 


Table 1 Sample sizes and response rates from 5 strata for 1998 
MTHS 


Design Total SampleSample Response % of % of Strata 
Stratum Permits Sent Replied Rate (%) Sample Sampled 


Non- 1979S 1.600. 180 ees 735750220 5.96 
Resident 
North 2s 56 4t 532 O75 63.04 4 18:3 4.48 
Missouri 
South SA Spo aly eOUS ae623ath 928.4 4.39 
Missouri 


St. Louis 19,959 1,405 F094 68:94 Slos2 4.85 


Kansas 14,803 1,042 688 66.03 12.9 4.65 
City 
Total 110,691 8,000 5,321 66.51 100 4.81 


The MTHS is a post-season mail survey where the 
questionnaires are mailed to the hunter’s listed residence. 
The survey has questions specifying each day of the hunting 
season and asking hunters whether or not they hunted, in 
what county they hunted, and if their hunt was successful or 
not. In total 41% of the residents responded to the first 
mailing. Those who did not respond within two months 


pro car ts fee 


ee 


G4 North E48 South 


St. Louis 
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were mailed the survey again, to which 26% responded 
from the second mailing. The survey was mailed a third 
time to those who had still not responded after two months 
with 20% of the third mailing responding. This resulted in 
an overall response rate of 66.5%. From Table | we see that 
non-residents of Missouri had the highest response rate at 
nearly 75%. The two metro areas of St. Louis and Kansas 
City then followed with response rates of 69% and 66%, 
respectively. The rural areas in northern Missouri and 
southern Missouri had the lowest response rates at 
approximately 63%. 

The 1998 spring turkey hunting season in Missouri 
consisted of the 21 consecutive days beginning on Monday, 
19 April. During the first week, a turkey hunter could 
harvest one bearded turkey. If successful during the first 
week, the hunter was allowed to harvest only one additional 
bearded turkey during the last two weeks of the season. If 
the hunter was unsuccessful during the first week, the hunter 
could harvest two bearded turkeys during the second two 
weeks, but only one bird per day could be taken. During this 
3-week season, the turkey biologist wanted information 
concerning hunter success for the first day due to the 
opening day effect in the hunting season when hunting 
pressure normally exceeds that of any other day of the 
season. The remaining six days of the first week are 
combined into a second time period of interest, called week 
1. Week 2 and week 3 are modeled separately to give a total 
of four time periods. 


{1 Kansas City 


Figure 1 Sampling domain boundaries for the four Missouri resident sampling domains 
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The small area of interest is the county-time period 
combination for turkey hunters from each stratum. Turkey 
population management focuses on the county level as the 
smallest areal unit of interest due to hunters’ ability to 
indicate where they have hunted in reporting their success. 
A hunting license is required to hunt, thus we know how 
many total hunters there are in Missouri and in which 
stratum they live. Spreading the data out across 114 counties 
produces very sparse data within each of the counties. For 
this reason, rather than looking at success rates on a daily 
basis, we will use four time periods as defined above. In 
calculating the county-specific success rates, however, we 
use the number of hunting trips in each county. We do not 
know how many trips each hunter will take or in which 
county they will hunt. Thus, while we know the number of 
hunters in each stratum, the number of trips is unknown. 
Furthermore, while we know the number of hunters who 
have been sampled from each stratum, the number of trips 
to a specific county is unknown and random because a 
different sample of hunters will yield a different number of 
trips. There are two principle parameters to estimate and one 
random variable to predict. The first parameter is the total 
number of hunting trips taken by all hunters, known as 
hunting pressure. This is important to wildlife managers 
who are concerned about the quality of the hunting 
experience. Too many hunters in an area have the tendency 
to interfere with each other’s hunt, which in turn lowers the 
quality of the hunting experience. The second parameter to 
estimate is the hunting success rate, or the proportion of 
turkeys harvested per hunting trip. If there are not enough 
turkeys in a county, wildlife managers will close the county 
for a number of years in order for the turkey population to 
recover. The random variable to predict is the total number 
of turkeys harvested. In Missouri, every turkey that is 
harvested must be reported. In 1998, turkey hunters were 
required to go to a check station where their turkeys were 
tallied. It was expensive to maintain these check stations. 
One of purposes of this project is to predict the total number 
of turkeys harvested at the county level and compare this to 
the actual number recorded at the check stations. 

Next we present a model to estimate hunting success rate 
and hunting pressure, as well as predict total turkey harvest 
simultaneously. The model accounts for random sample 
sizes that the previous models of He and Sun (1998), He and 
Sun (2000), and Oleson and He (2004) did not. 


2.2 The Model 


Let pj, denote the success rate, y,, the number of 
SUCCESSES, Nz the number of trips in the sample, and NV ilk 
the total number of trips, in county i, at time j, for 
individuals from stratum k. We model turkey harvests with 
independent binomial distributions 
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ind 
Vie | Mijn Pix) ~ Binomial (n;,, Pi)» (1) 


where, i =1,..., 7 is the county (ie., study domain), 
Tab se 1s (Oe Cine Dell0U, GUC. iy ct ty eeeee nema Ene. 
design stratum. Here J = 114, J = 4, K =5 for MTHS. 
The previous analyses of the MTHS assumed a fixed 
sample size, ,,, which we believe is best considered 
random. We don’t know the number of trips made to each 
county until after the survey has been collected. We also 
don’t know the total number of hunting trips taken in the 
small area, NV Wik? only the number of potential hunters. Since 
hunters must stop after their second turkey, counties with 
higher success rates would be expected to have fewer 
hunting trips (or days) for the same number of harvested 
birds than a county with lower success rates. If there is a 
correlation, then the sample size must be considered 
random. Also, in Bayesian hierarchical modeling, if the 
distribution of a sample size is independent of the 
distribution of the response variable, then the estimates are 
identical for random and non-random (fixed) sample size n, 
see Durbin (1969). The estimated success rate is smoother 
for a fixed n,, than when it is random (Woodard, He and 
Sun 2003). Malec et al. (1997) applied the Bayesian small 
area estimation to the National Health Interview Survey. 
There are two major differences between our model and that 
of Malec ef al. (1997). First, the population sizes, N,,, are 
known in their model but unknown in our model. This is the 
main reason that we introduce the bivariate GLMM. 
Secondly, the logit of the success rate is modeled as a linear 
function of covariates in their model. Therefore, the 
estimates depend on the values of covariates but not the 
spatial locations. We will add a spatial component to the 
logit of success rates in addition to the covariates so that the 
estimates depend on both covariates and spatial locations. 
This is necessary if some important covariates are not 
available. 

To incorporate the randomness of the sample sizes, we 
model 7, with Poisson distributions 


ind 
(n,, | Nig) ~ Poisson (R,N,x)- (2) 


The mean and variance of the Poisson distribution for 7,, is 
a constant multiplied by the population size, N,,. This 
constant R, is the ratio of the number of hunters in the 
sample for stratum k to the total number of hunters from 
stratum k. This ratio can be calculated from Table 1. 

For the Poisson distribution, the overall sample size, n , 
is considered random. We presented in the previous section 
why we consider this assumption to be appropriate. If 7 , 
were fixed, then the multinomial distribution would be a 
more appropriate model. The likelihoods of these two 
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approaches are very similar and yield comparable results in 
either context (see Agresti 2002, pages 8-9). 


We model p,, using its logit function and N,, using a 
logarithm transformation, i.e., 
» Pijk 
Nik = og z A |e = log (Nx). 


Linear mixed models are used for the priors on both nj, 
and w,, by assuming 


Nie = 91 j¢ + Min + Sins 


Op = Oy jp + Urin + Cry. 


Here for a = 1, 2, 0,, denotes fixed effect due to the / 2 
time in stratum k, w,,, represents a random county effect, 
and random errors ¢,,, are iid N(0, Oe 

To complete the Bayesian hierarchical model, we need to 
specify thespriorsefornug sy =(Oy.™) One) pa Ames 
Un), and 5. He and Sun (2000) and Oleson and He 
(2004) show that there is significant spatial correlation 
among counties of Missouri in estimating the success rates. 
They use a conditional auto-regressive (CAR) structure to 
model spatial dependence between neighboring counties. 
The joint density of u,, is given by 


SU) = 


(208%) *| T-pyC [' exp |-54 


28% Uy I -PpyC) Uy } (3) 


Here 8 and 8) > 0 are variance components and 
I-p,,C is a nonnegative definite symmetric matrix 
(Besag 1974). I is the 7 x J identity matrix and C is an 
adjacency matrix whose component {c,,} is 1 if areas i and 
j share a common boundary with {c,}=0. We also 
define p,, and p,, to be the spatial correlation parameters. 


Let 24, S...< 4, be the eigenvalues of the matrix C. 
These matrices, I-—p,,C, are positive definite if 
Mink Og Sane ' (Clayton and Kaldor 1987). For the 


Missouri data, ie = 114 and the numerical values of 2, and 
Aii4 are -2.8931 and 5.6938, respectively. This means that 
the density of u,, exists if p,, is in (-0.3457, 0.1756). 

For the remaining priors, we assume the following. 6,,, 
is normal with a mean p,, and variance 1 aik Let p,, be 
uniformly distributed on the interval (A;", aie '). Finally, a 
common prior for the variance components 50 and 5 is 
inverse gamma (Gelman, Carlin, Stern and Rubin 1995) 
whose densities are proportional to 


(e) 
BY 4 haere 


[- 
—— a OKp 
(e)a) +1 (e) (ua +1 
Sak ‘ Sak SK : 


respectively. 

To evaluate the posterior distribution, we apply MCMC 
methods such as Gibbs sampling (Gelfand and Smith 1990) 
to obtain samples from the posterior distribution. For an 


NET 


overview of MCMC methodologies see Gelman etal. 
(1995), Gilks, Richardson and Spiegelhalter (1996) and 
Robert and Casella (1999). The full conditional distributions 
required for this evaluation may be found directly from 
those given in Fact | in Section 3. Most of these conditional 
distributions (9,,, u,,, 5, 5°?) are of standard forms and 
are easily sampled from. Other full conditional distributions 
(Nx, ©, and p,,) have log-concave densities for the 
MTHS. The adaptive rejection sampling method of Gilks 
and Wild (1992) can be used to generate random samples 
from log-concave densities. 


2.3. Bayesian estimation and prediction 


At this point, we have obtained estimates of (p,,, Nix). 
We wish to pool the estimates from the stratum together in 
order to estimate p, and N,. We will also predict the 
unobserved number of harvests in county 7, denoted h,. 

To obtain the estimates of (p,,N,), let (p®, N®), 
/ =1,...,L be the output from the sampled Gibbs chain 
after the burn-in sample. Define 


K Oar 
/ 2 Gaels 
Of ae hy Aad Be 


Pj > 
Doel, ie 


The posterior mean and variance of p, 
approximated by 


can then be 


E(p,y) = + Ls De (4) 
and 


V(p,) = ee Bs marge octane, 


respectively. Similarly define N,” = Df,Nj;. The 
posterior mean and variance of N;,, can be approximated 
from the MCMC simulation output as well. 

We now focus on Bayesian prediction of the unobserved 
harvest by their posterior predictive distributions. We have 
that y,, represents the number of harvested turkeys in 
county i, of time period j, and of sampling stratum k for 
those in the sample, whereas for those not in the sample, we 
let the number of harvested turkeys in county i, during time 
j, stratum k be represented as Vion Thus, the number of 
turkeys harvested in county i at time j for hunters from 
stratum k is hy, = y;4 + Yjq- Here y,, is a known value 
and we need only find y,,. We may think of y,, given 
(Nips Nijxs Pix) 8 a binomial random variable in the form 
of (1). Thus 

ind 


Vix | Ai Nine Pir) ~ Binomial(N,. — Nix» Pix): (6) 


Lett pe Ny), / =1,...,L, be the output from running a 
Gibbs chain after a burn-in sample. The predictive mean of 
Vin given data d = {(yjq, mq): i=1..1, f=)... J, 


ie oie 1S. (Den 
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LE, 
Evin | 4) = =D) (Nie — Nye) Pye (7) 


[=] 


Finally we write 


hit ~ Vigk +E (Vin | 4), 


A 3 A 
ae a (6) 


¥ the gl 
Pa) =D DOR ped Pye) 


2 
iT Jae KS 
EE Sn 00g -noith at. 


The proof is a straightforward application of conditional 
expectations. Note that h,, is defined as a random variable. 
It does not equal Ni, p,, just as y,, does not equal 

Nx Pix Therefore, one should not simply use NinP it to 
estimate h,, although it might be very close if N,, is large 
enough. 


2.4 Model fitting 


Fifteenthousand iterations were run for the Gibbs chain, 
of which 5,000 were discarded as a burn-in sequence. Thus, 
posterior estimates are based on 10,000 autocorrelated 
samples from the posterior distribution. In monitoring the 
convergence of Gibbs sampling, we have used the 
diagnostics of Heidelberger and Welch (1983) as well as 
graphical monitoring of the sample paths. Convergence 
diagnostics and posterior summaries were performed with 
the BOA software (Smith 2005). We let variance compo- 
nents 5° and 8“) follow a non-informative IG (2,1) prior 
which gives a mean of one and infinite variance. We give 
data-dependent priors to 0,, ~ N(—1.5,16) and 0,, ~ 
N(4,25). To obtain a data-dependent prior, we modeled the 
following two steps because a completely arbitrary data- 
dependent prior might lead to unreliable posterior estimates 
(Wasserman 2000). First, we began with noninformative 
(but conjugate) priors to obtain the posterior means and 
standard deviations for each of these parameters through 
MCMC simulations. We then set data-dependent prior 
means to be close to the posterior means of 0, and 9, , 
and the data-dependent prior standard deviations to be about 
ten times the posterior standard deviations, after the 
posterior estimates were obtained using the noninformative 
priors. The estimates using these two approaches were quite 
similar, but the model using the data-dependent priors gave 
smaller variances. 

We fit many simplified models for comparison and 
model checking. As a model selection tool we use the 
Deviance Information Criterion (DIC) suggested by 
Spiegelhalter, Best, Carlin and van der Linde (2002). DIC is 
a generalization of the Akaike Information Criterion that 
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measures the deviance over an MCMC run and includes a 
penalized fit measure taking into account the model 
dimension. Smaller values of DIC indicate a better-fitting 
model. Our model with data-dependent priors had a 
DIC = 13,910.5. No alternate models had a significantly 
reduced DIC value from this model. The model with non- 
informative priors on 0,,, and 0,,, had DIC = 14,700.4. A 
reduced model with common correlation parameters 
Pir = Pr = Pi = Pia =Pis aNd Pp) =Px = P23 =P24= 
p>; had DIC = 13,896.9. 

As another model check, we have calculated statewide 
averages of the estimates using both the simple naive 
design-based estimates and Bayesian estimates which 
follow in Table 2. At the statewide level, sample sizes are 
large enough to consider the design-based estimates reliable. 
The statewide design-based estimates and model estimates 
match closely. Thus, the Bayes estimator performs well in 
terms of the design consistency property (see You and Rao 
2003 and Jiang and Lahiri 2006). We note that the first day 
estimates are slightly lower due to smoothing, but the 
success rate estimates are still much higher for the first day 
than any other time period. 


Table 2 Success rate estimates for 1998 MTHS 
Design Stratum Period Kills Trips Design-based Bayesian 


Non-Resident Day 1 105 424 0.248 0.236 
Week | OBIS Biel, 0.154 0.155 
Week 2 198 1,481 0.134 0.132 
Week 3 Ty (O39) 0.118 0.120 
North Day 1 82 452 0.181 0.171 
Week 1 at wohl kG) 0.138 0.137 
Week 2 161 1,663 0.097 0.097 
Week 3 997 bso 0.088 0.090 
South Day 1 13879 Sl 0.184 0.178 
Week 1 224 2,475 0.091 0.092 
Week 2 WE 2375 0.078 0.069 
Week 3 90 1,748 0.052 0.053 
St. Louis Day 1 54. 346 0.156 0.149 
Week 1 Oh L280) 0.071 0.073 
Week 2 So ea be F 0.068 0.069 
Week 3 45 828 0.054 0.059 
Kansas City Day 1 i chal ie 0.199 0.181 
Week | 95°. 894 0.106 0.108 
Week 2 92 942 0.098 0.099 
Week 3 5S" OF 0.090 0.093 
Overall Day 1 431 2,235 0.193 0.181 
Week 1 892 7,943 0.112 0.111 
Week 2 722 7,718 0.094 0.088 
Week 3 364 4,951 0.074 0.075 
Total 2,409 22,847 0.105 0.102 
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2.5 Data analysis for MTHS 


Posterior means and standard deviations for parameters 
under all design strata are listed in Table 3. We note that the 
mean and variance estimates of 0,,, and 9,,, as well as 
8,,, and 0,,, for k =1,...,5 are approximately equal. 
This gives reason to believe that success rate estimates and 
hunting pressure estimates are similar for week 1 and 2 of 
the hunting season. There remains a difference in the first 
day and week 3 of the hunting season, though (Vangilder, 
Sheriff and Olsen 1990, Kimmel 2001). 


Table 3 Posterior means (Standard deviations) of model 


parameters 

Non-Res North South St. Louis K.C. 

Se 0.133 0.117 0.136 0.164 0.143 
(0.034) (0.030) (0.036) (0.051) (0.040) 

a 0.096 0.044 0.041 0.061 0.060 
(0.017) (0.007) (0.006) (0.011) (0.011) 

oe 0.146 0.148 0.140 0.166 0.186 
(0.040) (0.041) (0.038) (0.053) (0.059) 

5 0.570 0.511 0.633 0.373 0.356 
i. (0.092) (0.081) (0.094) (0.062) (0.063) 
614 -1.209 -1.530 =-1.244 -1.420 -1.472 
(0.131) (0.130) (0.135) (0.159) (0.160) 

a -1.613 -1.748 -1.867 -1.983 -1.928 
(0.109) (0.111) (0.130) (0.147) (0.150) 

ae -1.801 =2,081 alin? =2.032 -2.007 
(0.113) (0.113) (0.129) (0.148) (0.148) 

Gu -1.830 SANS: =2354 =2 05 -2.008 
(0: 132) (0.123) (0.138) (0.156) (0.153) 

Oy, 3.368 3.333 3.297 3.305 3.308 
(0.098) (0.098) (0.087) (0.088) (0.098) 

65 4.582 4.361 4.371 4.349 4.238 
(0.097) (0.093) (0.084) (0.081) (0.090) 

Bis 4.404 4.429 4.459 4.293 4.229 
(0.097) (0.093) (0.088) (0.083) (0.093) 

O04 3.693 4.093 4.039 3.934 3.891 
(0.101) (0.095) (0.086) (0.082) (0.094) 

6; 0.153 0.103 0.163 0.158 0.125 
(0.019) (0.071) (0.011) (0.016) (0.068) 

65 0.168 Oe 0.171 0.172 On 
(0.007) (0.003) (0.004) (0.004) (0.004) 


We also point out the correlation parameter estimates of 
p,, and p,,. The estimates for p,, for the hunting pressure 
are all near their upper boundary, suggesting a strong 
relationship between the counties when estimating the 
number of hunting trips taken. Most of the estimates for p,, 
are more than two standard deviations away from zero as 
well. 

The simple naive design-based success rate estimates for 
each county are in the first row of Figure 2. Design-based 
estimates in the counties range from 0 to 0.55 with some 
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counties having no estimate because n,, = 0. Alternatively, 
the Bayesian estimates for the success rate in county i for 
time j are plotted in the second row of Figure 2. The 
Bayesian model success rate estimates produce a much 
more sensible range from 0.03 to 0.30. We don’t expect any 
county to have a success rate estimate of 0. Also, turkeys are 
quite difficult to hunt and a high success rate is not sensible 
(He and Sun 1998, He and Sun 2000, Oleson and He 2004). 
Thus a lower value, such as that produced from the model, 
makes more intuitive sense. Standard deviations for the 
model success rate estimates are given in the third row of 
Figure 2. 

Success rate estimates tend to decrease over the course of 
the hunting season. In addition, the highest success rate 
estimates are in the northern portion of Missouri. This was 
also shown to be true in previous analyses of the MTHS 
using 1996 data (He and Sun 2000, Oleson and He 2004). 
We note that the highest success rate estimates are not in the 
same counties, though. The highest estimated rates from 
1998 have moved slightly to the east of where they were in 
1996. This shift is due to a time trend as noted by 
conservationists. 

We also produce estimates of the population size N,,. 
The model-based estimates of N,, are plotted in the first 
row of Figure 3. The standard errors for the model-based 
estimates are plotted in the second row of Figure 3. The 
values plotted for weeks 1, 2, and 3 are the daily averages 
for those weeks. It is apparent that more people were 
hunting on the first day than any other day of the hunting 
season. 

We plot the actual check station data as the first map in 
Figure 4. Using formula (8) we predict the model-based 
total harvest h, in the second map of the first row. From 
these figures, the model-based predictions look to be 
reasonably accurate. We expect to see a small amount of 
overprediction as is shown by comparing the plots. Some 
hunters may underestimate how often they went out to hunt 
and overestimate the number of birds they harvested. The 
overcount may possibly also be attributed to turkeys 
harvested but not reported at a check station. One more 
explanation could be that the hunters returning the survey 
are those who were more successful, and those not returning 
the survey were less successful or did not hunt. Comparing 
the predictions from the model to the actual harvest numbers 
is a confirmation that the model is appropriate and gives 
accurate predictions. Also, many states do not have check 
stations and this shows that they could obtain appropriate 
estimates at the domain level even with a small sample at 
the domain level. 
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Figure 2 Hunting success rates from stratified 1998 MTHS. Row 1 - design based estimates of 
success rates. Row 2 - Bayesian estimates of success rates. Row 3 - Standard deviations 
of Bayesian estimates of success rates 
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Figure 3 Estimated number of hunting trips from stratified 1998 MTHS. Row 1 - Bayesian estimated 
number of tips. Row 2 - Standard deviation of Bayesian estimates of number of trips 
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1998 Check Station Harvests 
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Bayesian Predicted Harvests 
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3. General formulae for a bivariate GLMM 


3.1 A general model 


We ignore the time component in the general model for 
simplicity. Let n, and Y, be the sample sizes and response 
variables of interest in study domain i and design stratum 
k, respectively. We assume that {Y,,,7,,} given unknown 
parameters..ins 0,0, 0,}, . with, p= 1).../, and 
k =1,..., K, are independent. Assume that the conditional 
density (mass) function of Y, given n, belongs to the 
following family of probability densities, 


20x | Nie Ma O) = 
exp[.(o,) WieNi — Bins MIE + Gin Mie OD), 9) 


where 7, 1s an unknown parameter, and it is often assumed 
the scale parameter , is known. The probability density 
function of n, 1s in the family, 


2 (Nig | Dix >) = 
exp[4,(o,) {m0 — B,(@,)} + Cy, >2)], 19) 
where @,, 1s an unknown parameter equal to a function of 
the population size N,,, and , is often known. The joint 
density of Y,, and n,, a bivariate GLMM, is then 
PO» Nix | Nine Pix , 5) wz 
LV | Nao Mie O 82M | Ox 2). CD 


The distribution family (10) is often called a generalized 
linear model, which includes binomial, Poisson, normal and 
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gamma distributions. See, for example, Sun etal. (2000). 
The distribution family (9) is a generalization of such a 
generalized linear model by including an additional 
parameter. Four special cases of (9) are the binomial, 
Poisson, normal and gamma distributions which are all a 
part of the exponential family. 

The bivariate GLMM is applicable when estimates at the 
study domains are of interest and the sample size n, is 
considered random as well as being part of the observation. 
The bivariate GLMM is also useful when estimates of N,, 
are required. 

A linear mixed model can be used as a prior for n, to 
account for the variability in n,. However, one might be 
interested in both yn, and @, ora function of n, and ,, 
where , is often a function of the population size. Here 
we would need to model yn, and m,, simultaneously. A 
general class of GLMM for n,, and @;, could be 


h(n,) = X, Ou + Su; + Cy, (12) 
h,(@,) = X50, + SU +e; (13) 


With a=1 2, X, = {4} and"S, = {s,,3 “are known 
design matrices. The vector 9,, is the vector of fixed 
effects, u,, 1S the vector of random effects, and e,,, are 
independent residual effects and e,, ~ N(0,5®). In 


addition, u,, and e,,, are assumed mutually independent. 


3.2 Additional priors 


To complete the Bayesian hierarchical model, we need to 
specify the priors for (6,,, U,,, oo), Cee Wiad Seon bee 
The common prior for fixed effects 6,, is normal with a 
large variance or a constant prior. Random effects are often 
spatially correlated. The density may be of the CAR form 
whose joint density is given by equation (3) with 
B,, =1-p,C. Finally, a common prior for variance 
components 8!) is an inverse gamma distribution. 

To evaluate the posterior distribution, MCMC methods, 
such as Gibbs sampling, may be used to obtain samples 
from the posterior distribution. We give full conditional 
distributions in the general case below. 


Fact 1 Let (Q | -) represent the conditional distributions of 
© given all other parameters and [Q|-] represent the 
conditional density. The conditional posterior densities of 
nN, and @,, are as follows. 


1) [mx | Te exp (4, )) Din Me — BM Mud] — 1/2842 
[A(nin) — X19 — SiMiel) 2/ (My) or equivalently 
Viz = A,(Mz.) has the conditional density, 
A(,) [Ve Ay Mi) 
[Vix | -] %€ exp —By(h,' (Vix); Nx) | 
rat 25i¢ “in - x1; 01% a3 Ne 
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ii) 


iii) 


Iv) 


v) 


v1) 


[@,|-] « exp(4,(0,)[%,0, — B,(@;)] — 1/285, 
[1 (@4) — X3; 85, — So; Ur] A (O;.) or 
equivalently v,, = 4,(@,) has the conditional 
density, 


Ay (2 )[Mh; rie) 
[von |] oc exp) —By (My '(Y2x))] 

1/283 (Voi¢ — X5,004 Sail) 
For a=1,2, we have the following conditional 
distributions. 


(0,.|:) are mutually independent and _ have 
conditional posterior distributions 


= 


1 t Hak 
—~~X (v_,-u ae 1 sie 
e: A ak 8) ce 


(u,,|:) are mutually independent and have 
conditional posterior distributions 


=I 
lo 1 
: Sear aes Ba 
e a: 


N, Silva = ea} 
ak 


-1 
l ow 1 
e Sat u By 
E88 


ak 


(5| -) have posterior distributions 
a + 1/2, Be 
Inverse Gamma| + 1/2(v,,—X,04x—-S U4) |- 


(Vax = X 02x S Ux) 


(5“| -)~ Inverse Gamma (a) + 1/2, B® +1/2u‘, 
BLU). 


t 
Vii) [Pal] 2] By, P? exp] fata} 


25% 


viil) If }, has the prior density p(,), then 


AO) Lit ie) 
[0,|-] < p(o,)exps— BA, (4), Mx)1 f- 
+ Ci Vigs Mins 9) 


ix) If, has the prior density p(o,), then 
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Ay(b,)[Mychy (rie) 
[>,|-] < p(o,) exp —B,(hy'(vy4))] 
+C, (Nix, >) 


Examining parts 7),ii/), and vii) of Fact 1, the 
conditional densities of n,,@, and p,, are often log- 
concave. 


3.3 Estimating quantities of study domains 


It is often of interest to estimate certain quantities related 
to the study domains. For instance, to estimate a quantity at 
domain i, let 


W, = Ap 3 


Bayesian estimates of y, can be easily computed based on 
a random sample from the joint posterior. For example, let 
(n?, 0), 1 =1,...,L and k =1,...,K be the output from 
MCMC dutiniee and define 


yo = (!) @: (!) o?). 
= fi(Nir > Oi 3-3 Nix © 


ved. -)) = Wate, Ko Le Kae Le as POSterIOr 
mean and variance of y,, the general forms of (4) and (5), 
can be approximated by 


Nix ix): 


Ew,|y) =4 byw? 
and 


ON ee i 


l=] 


respectively. 


3.4 Predicting quantities of study domains 


Now let N,, be the population size and Y, the response 
value of interest for those not in the sample from study 
domain i and design stratum k. We wish to predict the 
quantities Y,, = )£,¥,,, the total response value in study 
domain i. For given n,, Y,, should be of the same family as 
Y,, such that 

AO) inn 

23(Viel Nie Mio Nis 1) = exp] —Bine Nix -— Mx)} 

+ Cin Nig — Mis 1) 
This is simply (9) with n, replaced by N, —n,. To 
simplify notation, let € denote the parameters in the model, 
d represent the data. (In our case here d = {(yy, n,): 
etter kes Ay gan, 2c. amen include) the 
parameters in modeling both y, and n,.) Under the prior 
m(&) (either proper or improper), we have the posterior 


density [&| d] «< f(d|&)n(E). 
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Assume that a further observation y, follows 
g,(yv,,|4,&) that may be dependent on d. The predictive 
density of y,, given d is written as 


Lvald] = [, ssCal& DE|d]48 


Under the squared error loss function, the best predictor is 
then the predictive mean given by 


EVI) = f,{ J. vi SORIE 4) 5 Bla] de 


Similarly, the best predictor of h(y,,...,. Vx) given d is 
then 


ERY SS Vila} = |, BAD VIE d} [§|d] dg (14) 
where 
E {h( Vays» Vix 1G a} = 


K 
Vion yey 40 o> VOL [Dia l& A vn -Avie. (15) 
Si 9 SiK k=l 


For the distribution family g,(y;,|&d@), the right hand side 
of (15) often has a closed form expression. 

Bayesian predictions of (14) can be easily computed 
based on a random sample from the joint posterior of €. For 
example, let a) / =1,...,L, be the output from MCMC 
simulations. The posterior predictive mean (14) can then be 
approximated by 


L 
Eth» VgIG} = FD EW 9 VilO, B. 
i=1 


The posterior predictive variance of A(y,,.... Vj) given d 
may be written as 


VAs Vix da} = EV hr, 
+ VLEfh(y,, -- 


Lp Need} dl] 
+» Vix NE, 43/4], 


This predictive variance can be approximated by 


EL 
V {h(yry --» Vix la} = pe VA, on ye, d} 
I=] 


L 


+L EA on Yi E a} — EMO n on Yada. 


4. Comments 


In this article, we developed a bivariate GLMM 
containing two unknown canonical parameters. This was 
necessary to obtain estimates when the sample size was 
random and estimates N, were required in addition to 
estimates of n,. The model was built using two 
simultaneous GLMMs in a Bayesian hierarchical structure. 
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The proposed model has the advantage of being applicable 
to a wide array of problems. Introducing a random sample 
size and estimating the population sizes are useful 
techniques for many applications. 

Naturally, we think that there is an inverse relationship 
between hunting success rates and the number of trips taken 
to a county. In modeling this, we may think of a two-fold 
CAR model to view the relationship between yn, and NV, 
such as that of Kim, Sun and Tsutakawa (2001). A 
multivariate CAR model may be another approach to 
address this situation. 

For each of the spatial models, we have assumed a 
common correlation, p,, across the entire state. It may be 
more reasonable to include additional correlation terms in 
different productivity regions defined by the Missouri 
Department of Conservation. This would be an interesting 
and complicated addition to the model. In addition, the 
spatial structure used in this paper is similar to that of He 
and Sun (2000) and Oleson and He (2004) where this spatial 
modeling worked well. 

It may be useful to include the distance from hunter’s 
home to the hunting location to help estimate hunting 
pressure. Most hunters stay close to home when hunting and 
this information could be incorporated into the hierarchical 
framework. 

Note that the estimated harvest is higher than the check 
station harvest. This is partly because more successful 
hunters tend to reply to the mail survey. We are conducting 
research adjusting for the nonresponse bias. 
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Small area estimation of average household income based on 
unit level models for panel data 


Enrico Fabrizi, Maria Rosaria Ferrante and Silvia Pacei ' 


Abstract 


The European Community Household Panel (ECHP) is a panel survey covering a wide range of topics regarding economic, 
social and living conditions. In particular, it makes it possible to calculate disposable equivalized household income, which 
is a key variable in the study of economic inequity and poverty. To obtain reliable estimates of the average of this variable 
for regions within countries it is necessary to have recourse to small area estimation methods. In this paper, we focus on 
empirical best linear predictors of the average equivalized income based on “unit level models” borrowing strength across 
both areas and times. Using a simulation study based on ECHP data, we compare the suggested estimators with cross- 
sectional model-based and design-based estimators. In the case of these empirical predictors, we also compare three 
different MSE estimators. Results show that those estimators connected to models that take units’ autocorrelation into 
account lead to a significant gain in efficiency, even when there are no covariates available whose population mean is 


known. 


Key Words: European Community Household Panel; Average equivalized income; Linear mixed models; Empirical 


best linear unbiased predictor; MSE estimation. 


1. Introduction 


In recent years, the academic world has taken an 
increasing interest in the analysis of regional economic 
disparities that represent a serious challenge to the 
promotion of national economic growth, and thus to social 
cohesion. This is particularly true within the European 
Union, where regional disparities are a distinguishing 
feature of the economic landscape. This renewed interest in 
local economies has produced a growing demand for 
regional statistical information and has stimulated research 
on income distribution, poverty and social exclusion at the 
sub-national level. 

In the 1990s, Eurostat (the EU’s Statistics Bureau) 
launched the European Community Household Panel 
(ECHP), an annual panel survey of European households 
conducted using standardised methods throughout the EU’s 
various member countries (Betti and Verma 2002; Eurostat 
2002). The ECHP terminated in 2001, after eight waves. 
Currently, it is being replaced by the Survey on Income and 
Living Conditions in the Community (EU-SILC), which 
resembles the ECHP in many ways, but for which no data 
has yet been published. The ECHP panel survey covered a 
wide range of topics and, in particular, it made it possible to 
calculate disposable equivalized household income, which 
constitutes a key variable in the study of economic equity 
and poverty. 

The ECHP was designed to provide reliable estimates for 
large areas within countries called NUTS1 (NUTS stands 
for the “Nomenclature of Territorial Units for Statistics” 


which is defined according to certain principles described 
on the EUROSTAT web site http://europa.eu.int/comm/ 
eurostat/ramon/nuts/home regions en.html ). Unfortunately 
NUTSI correspond to areas (five groups of Administrative 
Regions in the Italian case) that are too large to effectively 
measure local area income disparity or to provide useful 
information for the purposes of regional governance. 
Therefore, to obtain estimates for a finer geographic detail, a 
small area estimation method has to be used and the 
problem is to select an appropriate and effective method. 

In this paper, in order to combine information from past 
surveys, related auxiliary variables and small areas, we 
consider several possible extensions of the well-known unit 
level nested error regression model (see Battese, Harter and 
Fuller 1988) for the estimation of the average of household 
equivalized income. Using ECHP panel survey data, we 
illustrate how such model could be potentially useful in 
improving the efficiency of small area estimates by 
exploiting the correlation of individual household incomes 
over time. 

In section 2, we present a general set-up for small area 
estimation using panel survey data and briefly review both 
design-based and model-based small area _ estimation 
methods. In this section, we develop empirical best linear 
unbiased predictors (EBLUP) and their mean squared error 
(MSE) estimators for selected unit level cross-sectional and 
time series models using the available theory on EBLUP for 
small area estimation (see Rao 2003, and Jiang and Lahiri 
2006a, for details). We note that cross-sectional and time 
series models were considered in the small area literature, 


1. Enrico Fabrizi, DMSIA, University of Bergamo, via dei Caniana 2, 24127, Bergamo, Italy. E-mail: enrico.fabrizi@unibg.it; M.R. Ferrante, Department 
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but only in the context of area level modelling (see Rao and 
Yu 1994; Ghosh, Nangia and Kim 1996; Datta, Lahiri, 
Maiti and Lu 1999; Datta, Lahiri and Maiti 2002; 
Pfeffermann 2002; among others). 

In section 3, we briefly review ECHP survey and 
describe how we use this survey data to conduct a Monte 
Carlo simulation study to compare different small area 
estimators and their MSE estimators. In section 4, we report 
results from the Monte Carlo simulation experiment. We 
note that the simulation experiment is aimed at evaluating 
design-based properties of all estimators, even if they are 
derived as model based predictors. We observed that the 
EBLUPs perform very well compared to the design-based 
estimators even though our pseudo-population exhibits signs 
of non-normality. The non-normality of the pseudo-popu- 
lation, however, seems to affect the efficiency of the MSE 
estimators. In our simulation, the Taylor series (see Prasad 
and Rao 1990; Datta and Lahiri 1999, among others) and 
the parametric bootstrap (see Butar and Lahiri 2003) MSE 
estimators are found out to be more sensitive to the non- 
normality than the jackknife method of Jiang, Lahiri and 
Wan (2002). We end the paper with a few concluding 
remarks. 


2. The small area estimation methods considered 


To describe sample data, let y,,, denote the value of a 
study variable for the i" unit belonging to the d" small 
area dor time? (d=). sgt? thal... Lod 1yx, Mz). More: 
over, let x,,, be the vector of covariates’ values associated 
with each y,,; (and whose first element is equal to 1), and 
let X, = {x!,,} be the nx p matrix of covariates’ values for 
the whole sample (n = >\,,,,). Let us suppose that we are 
interested in predicting small area means for the target 
variable at final time 7: ioe (d =1,...,m). Let us also 
suppose that the vectors of mean population values of 
covariates are known for time 7; we denote these vectors by 
Xe (di lee Ha) 


2.1 Design-based estimators 


A first solution to the small area estimation problem is to 
use direct estimators, that is, estimators employing only y 
values obtained from the area (and time) which the 
parameter refers to. The simplest of direct estimators of the 
population mean is the weighted mean. We denote this 
direct estimator as Vir pr(d =1,...,m) and we will be 
using it as a benchmark in the following sections. 

Synthetic estimators may be generally defined as 
unbiased estimators for a larger area with acceptable 
standard errors. They are used to calculate estimates for 
small areas, under the hypothesis that small areas have the 


Statistics Canada, Catalogue No. 12-001-XPB 


same characteristics as larger ones. Moreover, when infor- 
mation about auxiliary variables is available, a particular 
synthetic estimator, the regression estimator, may be 
obtained by fitting a regression model to all sample data. 
Note that the synthetic estimator is area specific with respect 
to the auxiliary variables but not with respect to the study 
variable. 

For instance, if we consider only those observations from 
the last wave (t=T), the simple regression model would 
be given by: 


ere, 
Vari = Xai B+ ari 


E(€q7;) = 9, E (en) Fils 


To take account of the complexity of the sampling 
design, the weighted least squares estimate B,, of B may be 
obtained, and thus the synthetic regression estimator will be 
given by: 


Var. RSYN = XCF B., aad Bick fad (1) 


Synthetic estimators usually display very low variances, 
but they may be severely biased whenever the model 
holding for the whole sample does not properly fit area- 
specific data. Composite estimators are weighted averages 
of a direct and a synthetic estimator. We consider the 
composite estimator: 


Yar. comp = bar Var, pr + = dgr )Var. RSYN> (2) 


where 
“ MSE, (War. rsyn ) 
MSE, (Var, pip ) + MSE (Var. rsyn) 


and MSE, signifies that the mean square error is evaluated 
in relation to the randomization distribution. This choice of 
ba leads to composite estimators Vy coyp that are 
approximately optimal in terms of MSE, (see Rao 2003, 
section 4.3). In practice, the quantities in the formula for 
o,r S are unknown and may be estimated from the data. 
Unbiased and consistent estimators can be obtained for 
MSE, (Var pr) =“p War. pr) using standard formulas. 
An approximately design unbiased estimator of 
MSE, (Var rsyn) can be obtained using the formulas 
discussed in Rao (2003, section 4.2.4). In particular, we 
calculate the approximation: 


bar 


> = = 2 = 
mse) (Var. rsyn ) © (War. syn ~ Yar.pir) ~ Yo War, pir) 


where mse, and v, stand for the estimators of the 
corresponding MSE, and V,. In particular, v, is the 
ordinary design unbiased estimator of V,,. We then take its 
average over d, as usual, in order to obtain a more stable 
estimator. In fact, one problem with mse, is that it can 
even be negative. 
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Moreover, a modified direct estimator borrowing 
strength over areas for estimating the regression coefficient 
can be used to improve estimator reliability. If auxiliary 
information is available, the generalized regression esti- 
mator (GREG), 

an @ 23 Men 
Yar, ore = XarBy + rap (3) 
J 


J€Sar 


approximately corrects the bias of the synthetic estimator by 
means of the term (2... W oN Dies, Wje;, based on 
regression residuals e,. 


2.2 Model-based estimators 


The model-based estimators we have considered are 
based on the specification of explicit models for sample data 
which approximate a hypothetical data-generating process. 
As a consequence, the problem of estimating Y,, comes 
down to one of prediction. Moreover, mean square errors 
and other statistical properties of estimators are usually 
evaluated with respect to the data-generating process. We 
have focused here on “unit level” models based on models 
relating y,; to a vector of covariates x,,. The use of 
explicit models has several advantages, the most important 
of which being the opportunity to test underlying 
assumptions. 

In the estimation of the small area means or totals of 
continuous variables, linear mixed models are very often 
used. A general linear mixed model can be described as 
follows: 


ys X6+Z,%+..+Z, v, +e, (4) 


where y = {y,,} 1s the n-vector of sample observations, B 
a pxl1 vector of fixed effects, v, is a q, x1 vector of 
random effects (j =1,..., s),e = {e,,} a vector of errors; X 
is assumed of rank p, Z; ={Zj;} is a nxq, matrix of 
incidence of the j” random effect. We assume that 
E(v,)=0,V(v,)=G,, E(e) =0,V(e)=R_ (all expecta- 
tions are wrt. model (4)) and that v,,...,v,,e are mutually 
independent. 

As a consequence, the variance-covariance matrix of y is 
given by: 


V =V(y)=)(Z,G,Z' + R=ZGZ'+R, 


jal 


where Z =[Z, |...|Z,]. It is usually assumed that matrices 
G, R depend on a k-vector of variance components y, and 
so we can write V(w) = ZG(w)Z’' + R(w). 

Note that at the level of individual observations, the 
model (4) can be rewritten as y,; =X), B+ Zigi¥, +--+ 
ZeaiVs + Can: 


Ss 
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We consider different specifications for linear mixed 
models, all of which can be viewed as special cases of the 
general model (4). For the sake of simplicity, we have 
adopted a unit level notation when describing the models 
considered. The first model: 


MM): ya, =Xiy B+Vy +o, + Cnj5 (5) 


may be obtained from formula (4) setting s =2, 9, =m, 
q, =T G,=071,,G, =o71,, R=o71,. It includes inde- 
pendent area and time effects, and therefore area effects are 
assumed not to evolve over time. This random effects 
structure corresponds to the assumption of a constant 
covariance between units that belong to the same area, 
observed at two different points in time. 


The second model: 


MM2: Yai = Xa B+ 84 + Cai (6) 


corresponds to the particular case in which s=l, 
gq, =mq, G, = sf bo R=o7I,. The effects of interaction 
between area and time are introduced, that is, we assume 
there are area effects which are not constant over time. 

The third model: 


MM3: Ya; = Xai B+¥q +a, + exnj> (7) 


is obtained setting s=2, g,=m, q,=T G,=071,, R=o7l,, 
while the generic element g,(h,k) of G, is g,(h,k)= 
ae ol hk =l,...,T . There are independent area and 
time effects, just as in MM1, but the time effects are 
assumed to follow an AR(1) process. 


The fourth model: 
MM4: Va, =X B+ 8, + ni» (8) 


is similar to model MM2 in that it is characterized by time 
varying area effects, but the further assumption that such 
effects follow an AR(1) process is also introduced. Thus, 
provided we order observations by area, with respect to the 
general formula (4) we have s=l,g,=mgq,G,= 
diag(G,,), R= oI, where ,Gj5. dpa lies, te 1S, an dX. 
matrix the generic element g,, (hk) =o3pe" h, k = 
derurauthte 
The last specification: 


MMS: Vue =Xa Pt Vg +O, +255 (9) 
may be obtained by (4) setting s =2, 9g, =m, q, =T,G, = 
o-1,,G,=0.1,. Provided we order observations by 
household and time, R = diag(R,,) where R,, isa TxT 
matrix whose generic element is given by 17,(h,k)= 
o. oh ak =l,...,T . There are independent area and 
time effects like in MM1, but errors are assumed to be 


autocorrelated according to an AR(1) process. 
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In order to evaluate the impact that using past survey 
waves has on the efficiency of estimator, a cross-sectional 
linear mixed model (SMM) using data from the last wave T 
only, has been taken as the benchmark: 


SMM > Youre =Xo, Pt+9, + ep; (10) 
ID IID 
with 9, ~N(0,65), ear ~ N(0, 02). 

This is also a particular case of (4) obtained for s = 2, 
q,=m,G,=o.1, and R=o7I,. Note that (10) is the 
standard nested error regression model of Battese et al. 
(1988). 

We also consider the corresponding random error 
variance linear models (see Rao 2003; section 5.5.2) 
obtained by replacing x',,8 in formulas (5) - (10) with a 
general intercept 0. These models will be denoted as 
MM1*, MM2*, MM3*, MM4*, MM5*, SMM*. All the 
assumptions made regarding random effects and residuals 
remain unchanged. This latter group of models enables us to 
explore the gains in efficiency obtained by exploiting the 
repetition of the observation on the same unit when no 
covariates are available at the population level. 

In small area estimation, the aim is to predict scalar linear 
combinations of fixed and random effects of the type 
n=m’B+k’v where m and k are px! and qx1 vectors 
respectively, with g=),q,;. The best linear unbiased 
predictor (BLUP) of yn can be obtained by estimating 
(B, v) minimizing the model MSE among all linear 
estimators: 


AYE (w) = m’ BQy) +k’ (wy). (11) 


When the variance components in y are unknown, they 
may be estimated from the data and substituted into formula 
(11), thus obtaining “empirical BLUP” 7°"? (Wy) = 
m’'B() +k’ P(A) (see Rao 2003, chapter 6, and Jiang and 
Lahiri 2006b for details). 

As far as the estimation of wy is concerned, a number of 
methods have been proposed in the literature, such as 
Maximum Likelihood (ML) and Restricted Maximum 
Likelihood (REML) which assume the normality of random 
terms, and the MINQUE proposed by Rao (1972) which is 
non-parametric. In the present work we have opted for the 
REML method, thus assuming normality. 


2.3. Measures of uncertainty associated with 
predictors based on linear mixed models 


The difficult problem of estimating the MSE of EBLUP 
estimators, taking the variability of the estimated variance 
and covariance components into account, has been faced in 
the small area literature by adopting diverse approaches. 

One popular method is based on the Taylor series 
approximation of MSE under normality (Prasad and Rao 
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1990; Datta and Lahiri 1999). More recently, due to the 
advent of high-speed computers and powerful software, 
resampling methods have been proposed. For instance, 
Butar and Lahiri (2003) introduce a parametric bootstrap 
method based on the assumption of normality, but analyti- 
cally less onerous than the Taylor series method. Jiang et al. 
(2002) discuss a general jackknife method, which requires a 
distributional assumption weaker than normality (posterior 
linearity). We aim to empirically compare the performance 
of these three estimators within a context where the number 
of areas is moderate and the assumption of normality may 
not hold perfectly true. The following is a short description 
of the three estimation approaches. 

Let us define MSE[A*"" (h)]= E> (H) -n)’, 
where expectation refers to model (4). It is possible to show 
that, under normality, 


MSE[A ()] 
= (wre (wrEGy nh.) (12) 
where g,(w)=k'(G-GZ'V'ZG)k’ and g,(w)= 


d'(X'V 'X) 'd, with d=m'—k’GZ'V 'X (see Rao 2003, 
chapter 3). Using the following approximation, based on a 
Taylor series argument 


Fo Si aaaa EBLUP iV 


A 
~ tr[(db’/ dy) V(db'/ Oy) V(w)]= g3(y) 


where b’=k'GZ'V"', a second order approximation to 
(12) can be found: 


MSE (= 2(y)+2,(y)+2,(y). (13) 


Note that here ~ means that the omitted terms are of 
order o(m'). An asymptotically unbiased estimator of 
(13), based on Prasad and Rao (1990), is given by 


EBLUP 


MSEpp (1 )=2,00) + g,() + 22,(). (14) 


Datta and Lahiri (1999) show that, under normality and 
REML or ML estimation of y, mse,, (f°) estimates 
MSE[#)” (y)] with a bias of order o(m™'). 

Butar and Lahiri (2003) propose a parametric bootstrap 
estimation of (13) under the assumption of normality. We 
adapt their estimator to the models we are analysing, 
assuming the following bootstrap model: 

ind 

i) y |v ~N[XB+Zv, R(W)] 

(15) 
ind 

ii) v ~ N[0,G(W)] 


where v =(v,,...v.)’. The parametric bootstrap is then used 
twice, once to estimate the first two terms of (13), thus 


Survey Methodology, December 2007 


correcting the bias of g,(\v)+2,(W), and once to estimate 
83(). 
The following estimator of (13) is proposed: 


~ EBLUP 
mse,, (1 ) 


= 2[2,() + 2(W)]- Ele (0) + 2008) 
+ E,[ily. BOW).W) - Aly. BOW), W)] (16) 


where y is the same as W except that it is calculated on 
y instead of y, and E, is the expected value with regard 
to the bootstrap model (15). 

The bootstrap estimator (16) does not require the 
analytical derivation of g;(v) which can be rather 
laborious when G and R have complicated structures. 

Jiang etal. (2002) introduced a general jackknife esti- 
mator for the variance of empirical best predictors in linear 
and non-linear mixed models with M-estimation. In the 
problem we are investigating here, the estimator they 
propose can be written as: 


m 


“s . —1 » A 
mse, wy (ij oi) = 2 i cars [2,(¥_,)- 2,0] 
j=l 


= |e 
ea - Set a eck ce (17) 


where w_, is the estimate of y calculated by using all data 


except those from the j” area. Similarly, AP" = 


Pe? y_;. BOH_,). W_,)- 

It is worth pointing out that, on the basis of the 
simulation results reported in Jiang ef al. (2002), mse, 18 
deemed to be more robust than mse,, with regard to 
departures from the assumption of normality, which can 
also be expected to be crucial for mse,, . 


3. The simulation study based on the European 
Household Community Panel data 


The target population of the ECHP survey consists of all 
resident households of a large subset of the EU member 
countries. Although general survey guidelines were issued 
by Eurostat, a certain degree of flexibility was allowed, so 
there are some differences in the sampling design across 
countries. As far as Italy is concerned, the survey is based 
on a stratified two stage design, in which strata were formed 
by grouping the PSUs (municipalities) according to geo- 
graphic region (NUTS2) and demographic size. For more 
details of the survey, see Eurostat (2002). 

The ECHP deals with unit non-response, sample attrition 
and new entries using weighting and imputation. As attrition 
could lead to biased estimates of income if it does not 
appear at random, the effect of poverty on dropout 
propensity has been investigated (Rendtel, Behr and Sisto 
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2003; Vandecasteele and Debels 2004), and the results of 
these studies show that in the case of some countries, 
including Italy, this effect disappears under the control of 
weighting variables. 

We have focused our attention on the eight ECHP waves 
available for Italy (1994-2001). Given that our aim is to 
assess whether the use of several successive observations of 
the same household could be profitable for the purposes of 
small area estimation, we have overlooked the problem of 
attrition and only considered those households that partici- 
pate to the survey for all waves. 

Our target variable is disposable, post-tax household 
income at the time of the last wave (2001). In studies of 
poverty and inequality, income is often equivalized 
according to an equivalence scale in order to avoid compa- 
rison problems caused by differences in the composition of 
households. We consider the widely-used modified OECD 
scale, also adopted by Eurostat (2002) in its publications on 
income, poverty and social exclusion. According to this 
scale, equivalized income is calculated by dividing 
disposable household income by the number (hk) of 
“equivalent adults”, defined as k =1+ 0.5a+0.3c, where a 
is the number of adults other then the “head of the 
household” and c is the number of children aged 13 or less. 
In general, the equivalized income can be perceived as the 
amount of income that an individual, living alone, should 
dispose of in order to attain the same level of economic 
wellbeing he/she enjoys in his/her household. 

Of the many covariates available in the bountiful ECHP 
questionnaire, we have chosen only those for which area 
means were available from the 2001 Italian Census results. 
Thus the chosen covariates are: the percentage of adults; the 
percentage of employed; the percentage of unemployed; the 
percentage of people with a high/medium/low level of 
education in the household; household typology (presence 
of children, presence of aged people, efc.); the number of 
rooms per-capita and the tenure status of the accommo- 
dation (rented, owned efc.). 

As we have said, the aim of this paper is to compare the 
performance of different estimators in the controlled 
environment of a simulation exercise. A number of works in 
the literature have compared small area estimators using 
Monte Carlo experiments in which samples are drawn from 
synthetic populations based either on Censuses (Falorsi, 
Falorsi and Russo 1994; Ghosh etal. 1996) or on the 
replication of sample units’ records (Falorsi, Falorsi and 
Russo 1999; Lehtonen, Sarndal and Veijanen 2003; Singh, 
Mantel and Thomas 1994). Since household income is not 
measured by the Italian Census (nor is it given by the results 
of other Censuses conducted by EU countries), we treated 
the ECHP survey data as the pseudo-population and then 
draw samples using stratified probability proportional to 
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size sampling, the size variable being given by survey 
weights. This solution may not be as good as that of using 
data from a real Census population, but it is hopefully more 
realistic than generating population values of household 
income from a parametric model. 

Monte Carlo samples of 1,000 (roughly 15% of the 
actual ECHP sample size) were drawn from the synthetic 
population by stratified random sampling without replace- 
ment, with strata given by the 21 NUTS2 regions. Thus 
these regions are treated as planned domains (as in the 
ECHP) for which sample size in the small areas is 
established beforehand, so that the sampling fractions reflect 
the over-sampling of smaller regions exactly as they do in 
the actual ECHP sampling design. The region-specific 
sample sizes we obtained range from 14 to 112, being on 
average equal to 48. Therefore in our simulation = 1,000, 
the number of small areas corresponds to that of the Italian 
Regions (m=21) and the number of points in time 
corresponds to the ECHP available waves (T = 8). 

The distribution of equivalized household income in our 
pseudo-population (that is the distribution obtained by 
weighted estimation from the ECHP sample data) is 
characterized by an overall mean of 22,547 Euros and a 
coefficient of variation of 0.59. The distribution is positively 
skewed (even though skewness is not extreme: skewness 
coefficient y, =H, {o> =2.5) and kurtosis (k= p,/ 
o* = 14.3). The difference between mean and median is 
9% of the mean. An interesting feature is given by the large 
disparities among administrative regions (that are the small 
areas of interest in our study). The mean of the equivalized 
household income ranges from 16,604 to 27,011, that is the 
most affluent area has a mean equivalent income 62% 
higher than the poorest one. Also the coefficient of variation 
(ranging from 0.28 to 0.84), skewness (y, ranging from 0.1 
to 4.6) and kurtosis (« ranging from -0.7 to 32.9) show that 
the distribution of our target variable is quite a bit different 
in different areas. 
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To motivate the selected specifications of the random 
effects part of the considered linear mixed models (see 
section 2.2), an approach often recommended in textbooks 
(see Verbeke and Molenberghs 2000, chapter 9) has been 
followed: first we fit a standard OLS regression to our data 
using all available covariates; then we analyse the resulting 
residuals as a guide to identifying the random effects. This 
preliminary analysis has been conducted separately on 
several random samples of size 1,000, drawn up according 
to the replication design described above. 

The adjusted R* of the OLS regression is close to 0.35 
in every observed sample. This rather low figure is the result 
of the nature of the phenomenon under study (household 
income is not easy to predict), the information contained in 
the survey and the constraint represented by the need to 
include only those covariates for which the population total 
can be obtained from the Census. 

Figure 1 contains “box and whiskers” plots of the 
residuals by area and wave constructed for one of the Monte 
Carlo sample (very similar findings may be observed in 
every sample). Analysis of the plots suggests that there is 
within-area and within-wave correlation, and thus the need 
to specify models including area and wave effects. From an 
analysis of residuals, it is less clear whether the inclusion of 
interaction effects (that is time varying area effects) would 
be beneficial or not. 

Moreover the residuals show a degree of autocorrelation, 
the average of the autocorrelation coefficient calculated over 
all individual residual histories being 0.27. Even though this 
autocorrelation level is not very high, for the sake of 
completeness we decided to also take into consideration 
models with autocorrelated errors or random effects. After 
having tested various different autocorrelation structures 
(ARMA(p, q), General Linear, efc.), we found that the auto- 
regressive process of order | provides the best fit to our 
data. 


7 ¥ T we. 2.2 ot le ok ae 
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Figure 1 Box and whiskers plot of residuals by wave (left) and area (right) 
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The apparent skewness of residuals also suggests that the 
normality assumption for errors does not hold exactly. We 
maintain this assumption for all the models we specify, and 
we use REML estimators for variance components. In fact, 
we may expect departures from normality to have a slight 
impact on point values of predictors. BLUP formulas can be 
derived without normality; moreover, there are sound 
reasons for us to expect REML (and ML) estimators of wy 
to perform well even if normality does not hold (see Jiang, 
1996, for details). Departures from normality may have 
more a serious impact on MSE estimation, and this is a 
problem we are going to be looking at in section 4.2 below. 


4. Results 


4.1 Point estimators 


All computations involved in the simulation exercise 
described in section 3 were carried out using SAS version 
9.1 for Windows. EBLUP estimators are obtained using 
Proc MIXED, and the generation of samples is based on 
Proc SURVEYSELECT. 

Given that the primary goal of Small Area Estimation is 
the precise estimation of area-specific parameters, we first 
evaluated how well the described estimators perform when 
predicting individual area values. Moreover, we also 
evaluated the amount of over-shrinkage connected with 
each estimator. In fact, small area estimates should reflect 
(at least approximately) the variability in the underlying area 
parameters taken as a whole. 

We note that our simulation experiment is aimed at 
evaluating design-based properties of the estimators, that is, 
the population from which the random samples are 
generated is held fixed. 

For the evaluation of the estimators’ performance, we 
adopted an approach that is commonly found in the 
literature (see Rao 2003; section 7.2.6), using two 
indicators, the Average Absolute Relative Bias (AARB) and 
the Average Relative Mean Square Error (ARMSE): 

m R 1 
AARB=m' > ey 7ee-1 
y aT 


d=) r=1 


; (18) 
~ = Varn) 
ARMSE =m™")) +R) | = =1 


d=\ r=1 aT 


where j’,7;,) 1s the estimate for area d, time 7 and replicated 
sample r, while Y,,. is the population mean being estimated. 
Note that AARB measures the bias of an estimator, whereas 
ARMSE measures its accuracy. The number of replications 
R is set at 500, a figure large enough to obtain stable Monte 
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Carlo estimates of expected values and variances, frequently 
used in simulation studies on small area estimation (Heady, 
Higgins and Ralphs 2004; EURAREA Consortium 2004). 

The gain in efficiency connected to each small area 
estimator is evaluated using the ratio of its ARMSE to the 
ARMSE of certain estimators we use as benchmarks. In 
particular, all estimators are compared with the weighted 
mean Yr pp and we denote this ratio as AEFF,,.. 
Moreover, EBLUP estimators associated with models (5) - 
(9), which use data from previous waves, are compared with 
the EBLUP estimator associated with the cross-sectional 
model (10), in order to assess the gain in efficiency deriving 
from the use of past waves. In this case the ratio is denoted 
PIG aI ab ge ae 

As far as the evaluation of the degree of shrinkage is 
concerned, we have compared the empirical standard 
deviation of population area values: 


ESD = Im" > Lop -¥,), 
d=l 


where Y, is the mean of the population values of the m 
areas at time 7, with the empirical standard deviation of the 
estimated area values, which in the case of a simulation 
study is given by: 


R m 
esd = R! » m | Sere on Pron) | 
r= d=| 


where Vit, is the mean of the estimated values for the m 
areas at time J in the simulation run 7. The comparison is 
carried out using the indicator 


esd 
ESD 


RESD = 


oy (19) 


which tells us how the empirical standard deviation 
associated with one estimator differs from that of the 
population. 

Table 1 contains the percentage values of AARB, 
ARMSE, AEFF and RESD obtained for the direct esti- 
mator, the design-based estimators given in (2) and (3) and 
the EBLUP estimators derived from models (5) - (10). 

All estimators perform significantly better than Vr pip 
in terms of ARMSE, leading to less than 100% AEFF,,, 
values. We can also see that design-based estimators are 
worse than EBLUP estimators in terms of ARMSE, and that 
the gain in efficiency demonstrated by AEFF,,._ is 
particularly high in some cases (in excess of 50%). This 
result highlights the superior accuracy of the model-based 
estimators in question. 
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Table 1 Performance indicators - auxiliary information is available 


Model AARB% | ARMSE% 
DIR 0.0 0.787 
COMP Deh 0.552 
GREG 0.2 0.543 
SMM Dame 0.377 
MM eg 0.358 
MM2 2.4 0.427 
MM3 2.6 0.380 
MM4 2.6 0.429 
MMS Is) 0.318 


The most reliable EBLUP estimator is the one associated 
with the MM5 model, with independent area and time 
effects and residuals autocorrelated according to an AR(1) 
process, leading to a gain in efficiency of about 60% 
compared with the direct estimator. This is followed by the 
EBLUP estimator associated with model MM1, which 
differs from the previous one only because of the absence of 
autocorrelated residuals. 

In terms of bias, the GREG estimator gives the smallest 
value of AARB, as would be expected (Sarndal, Swensson 
and Wretman 1992, chapter 7; Veijanen, Lehtonen and 
Sarndal 2005). This is followed by the remaining estimators, 
all of which reveal a similar value for AARB. Of the 
EBLUP estimators, those associated with the M1 and 
MMS models are more efficient in terms of ARMSE, but 
they are slightly more biased than the one associated with 
the SMM. This is probably due to the fact that we limit our 
evaluation of performance to the last wave; for this data 
subset we would expect the fit of the regression underlying 
SMM, based on the last wave only, to be better than the one 
based on the whole data set. As far as EBLUP estimators are 
concerned, the AEFF,..,. column shows how the gain in 
efficiency of the predictors, based on borrowing strength 
over time, is positive in some cases and negative in others. 
Models MM2 and MM4 (see formulas (6) and (8)), where 
effects of interaction between area and time are present, are 
apparently inadequate because the predictors associated 
with both models perform rather poorly. The performance 
of the predictor associated with MM3 (see (7)) is also 
slightly worse than that of the predictor associated with the 
cross-sectional model: this rather surprising result 1s 
probably due to the low number of waves, which does not 
allow for an effective estimation of the correlation 
coefficient between consecutive time effects. 

As we have already said, the estimator associated with 
model MM5 is the one that performs the best: it is 
considerably more efficient than the one associated with 
SMM, with an AEFF,.., of roughly 85% representing a gain 
in efficiency of about 15% due to consideration of more 
than one wave. The EBLUP estimator associated with MM1 
also turns out to be more efficient than the one associated 
with SMM, but in this case the gain is one of only 5%. 
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AEFFpi% AEFFsecr% =RESD% 
100.0 - 15.6 
70.1 - -9.8 
68.2 - 10.0 
47.7 100.0 -8.7 
45.3 95.0 2.4 
54.1 113.4 4.4 
48.3 101.2 4.7 
54.2 113.6 -8.0 
40.4 84.7 -7.7 


These results confirm the fact that household level data at 
several consecutive points in time may be employed, via 
certain kinds of longitudinal model, to produce more 
efficient estimates. 

Moving on to the indicator for shrinkage reported in the 
last column of the table, we can see that the direct estimator 
overestimate the standard deviation of the population of area 
means, by 15%. The same effect, albeit somewhat 
attenuated, is observed for the GREG estimator, whose 
standard deviation is over-inflated by 10%. On the contrary, 
the COMP estimator tends to “shrink” the estimates towards 
the centre of the distribution, leading to a reduction in the 
standard deviation of area means of about 10% with respect 
to the population. These results are in line with those 
obtained by other authors comparing the same kinds of 
estimator (Heady et al. 2004; Spjotvoll and Thomsen 1987). 
The results obtained for EBLUP estimators are more 
encouraging, as the calculated percentage difference is 
always less then 10% in absolute terms. Hence, in this 
respect all EBLUP estimators seem to be acceptable. 
Moreover, we may expect that the BLUP estimators are 
under-dispersed compared to the corresponding population 
parameters. In this case, the indicator RESD assumes 
positive values for some longitudinal EBLUP estimators 
because it is calculated only on the last wave, while 
longitudinal models are aimed to predict m xT parameters. 

Table 2 summarizes the results regarding those EBLUP 
estimators associated with random error variance models, as 
described in the last paragraph of section 2. When no 
auxiliary variables are included in the models, the advantage 
of “borrowing strength” over time and area is singled out 
independently of the advantage associated with covariates. 

As expected, the improvements in efficiency measured 
by AEFF,, are smaller than those shown in Table 1, 
although the reductions in ARMSE remain significant. The 
ranking of those predictors associated with the various 
random effects specification remains the same as the one 
presented in Table 1, the predictor associated with the 
MMS model resulting the most efficient, as shown by 
ARMSE%. The gain in efficiency associated with this latter 
estimator compared with the direct estimator is about 43%. 
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Table 2 Performance indicators - auxiliary information is unavailable 


Model AARB% ARMSE% 
SMM’ oF 0.575 
MMI" 2.9 0.556 
MM2° Jip: 0.639 
MM3° eT 0.574 
MM4" 3.5 0.691 
MMS" 3.0 0.445 


With regard to bias, the EBLUP estimators obtained 
from those models with no covariates tend to be more 
biased than the corresponding ones with covariates. 

The analysis of the AEFF,,. column shows that the 
reduction in ARMSE allowed for by some of those models 
borrowing strength over time, is larger than in the case 
where covariates are included, as it reaches 22% in the best 
example of the MM5° model. 

This last result is really encouraging. In fact, within the 
context of Small-Area Estimation, the absence of any 
known totals of covariates in the population can be very 
limiting when trying to obtain reliable estimates. The 
observed ARMSE reduction connected to the consideration 
of more waves in a panel survey show that estimates may be 
improved “borrowing strength” over time, when it is not 
possible to exploit auxiliary information. 

With regard to the results of shrinkage, they may be 
considered acceptable also in this case, and one can see a 
relationship between the results obtained for EBLUP 
estimators derived from analogous models with or without 
covariates. 


4.2 Comparing different estimators of the MSE of 
EBLUP estimators 


In section 2.3 we reviewed three different estimators of 
the MSE associated with EBLUP estimators. In this section 
we are going to compare the performances of these three 
estimators using our simulation exercise. Given that we are 
focusing on MSE estimation rather than a comparison of 
EBLUP estimators derived from different models, we only 
consider the predictor associated with model MM5, which 
emerged as the best performer in the previous section. 

Let us denote the predictor of Y,, with 7Ep°” 
mean square error as MSE(f\,,). The following esti- 


mator: 


and its 


ee 4 

~ EBLUP ~EBLUP =EBLUP\2 

MSC acy (Nar == Laren Nam) 
r=] 


=EBLUP = \2 
ee tary 


AEFFpi-% AEFFsee% =RESD% 
72.8 100.0 -7.6 
70.3 96.6 TS 
80.8 111.0 -3.0 
72.0 99.7 8.6 
87.2 119.8 -6.7 
56.2 Tide -6.3 

~ EBLUP ~ EBLUP 


is fp calculated on the r™ replicated 


sample and Tr =R'DE, farg) > Will be used as 


benchmark for the comparison of the performance of the 


where Marc) 


mean square error estimators described in section 2.3, 
because the true mean squared error is not known. 

As in the case of point estimators, all computations are 
done using SAS. To determine the Prasad-Rao estimator 
(14), the output of Proc MIXED’s ESTIMATE statement is 
used with the option KENWARDROGER activated. The 
sum g,(W)+g2,(W) is obtained from the output of Proc 
MIXED. The KENWARDROGER option allows for the 
calculation of an MSE inflation factor, described in 
Kenward and Rogers (1986), which is equivalent to 
22,(W) (see also Rao 2003, section 6.2.7). 

The estimator mse,, (fp ~”) is re-sampling based. 
Hence the evaluation of its performance with respect to a 
Monte Carlo exercise requires the implementation of two 
nested simulations: for each r(r=1,...,R), we run the 
Rgoor replications needed to approximate expectations 
with respect to the bootstrap model. To limit the 
computational burden, we set Roo; =150. Butar and 
Lahiri (2003) propose an analytical approximation of 
mse,,, but only for models that are not as complex as the 
one in question. 

For both mse,, (fin) and mse, y (fir), we have 
prepared ad-hoc SAS codes using the output of Proc 
MIXED as inputs. 

In order to compare the three MSE estimators, we 
employ the same measures used to evaluate the performance 
of point estimators, AARB and ARMSE. As there is usually 
some concern about the under-estimation of MSE 
estimators, we are also interested in the sign of any bias 
associated with the estimators in question. Therefore, in the 
case of MSE estimators we do not only calculate the 
average of the absolute values of the estimates obtained for 
the bias in each region (AARB), but also the average of 
these estimates without the absolute value (AARB’), so as 
to better understand whether the given estimators indeed 
tend to under-evaluate the MSE or not. Hence the calculated 


‘Measures are: 
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¥ m 5 R mse, (7) 
AARB=m™')> IR 3 { ee 
d=l rA\Mseact(ar ) 
m R ~ EBLUP 
' “i 4 mse.(Nar __) 
AARB' =m" >) 4R | ee 1] 
d=1 r= MSE scr (Nar ) 


2 

m R ~ EBLUP 
ARMSE =m! 54 R7! Pesaran =i 
d=l ra\ MS€acr(Ngr  ) 


where the symbol * refers to the considered estimation 
procedures, that are PR, BL, JLW. Results of the 
comparisons based on R =500 MC iterations are reported 
in Table 3. 


Table3 Performance of MSE estimators of he oes under 


model MMS5 
Estimator AARB AARB' ARMSE 
MSEpp 0.378 -0.383 0.238 
msepy 0.377 -0.318 0.228 
msey w 0.337 0.036 0.261 


In terms of ARMSE and AARB, the three estimators 
behave similarly, with no particular one emerging as clearly 
better than the other two. Nonetheless, the AARB' column 
clearly shows that mse,, and mse,, systematically 
underestimate MSE,.;, whereas mse,,y, does not. This is 
probably due to the failure of the normality assumption for 
error terms. In fact, as we foresaw in section 3, equivalized 
income is a positively skewed variable, and the regression 
residuals e also appear to be so. Normality is a crucial 
assumption in the derivation of mse,, and mse,,, while 
mse, y could be expected to be more robust in this respect. 
Our findings are consistent with the theory predictions and 
simulation results described in Jiang et al. (2002). Although 
Bell (2001) noted that mse,,,, may be negative for some 
data set because of the bias correction, this never happens in 
our simulations. For all replicated data set we have that the 
second term in (17) gives a positive, and in most cases 
substantial contribution to the estimate of the MSE. A 
discussion of modifications of (17) when it returns negative 
values can be found in Jiang and Lahiri (2006b). 

To conclude then, in the case of the present problem, 
mse,,y emerges as the most appropriate of the three 
measures for estimating MSE(#i,'"”). This finding could 
be of importance for any application of normality-based 
linear mixed models theory to data set in which normality 
assumptions for error terms do not hold exactly. 

We replicated the simulation exercise also for the cross- 
section model without covariates SMM’, that is often 
considered in simulations aimed at the comparison of 
different estimation methods. To this end we note that for 
this model the ratio 67/6? is around 12, leading to a 
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EBLUP predictor characterized by y, = 62 n,(67.n, + 62) ' 
ranging from 0.54 to 0.9. We note also that some, but not 
all, areas are characterized by the presence of outliers 
(skewness coefficient y, ranges from 0.1 to 4.6). 

In this setting MSE estimators show a behavior quite 
different form that illustrated in the case of model MMS. 
Results are shown in Table 4. 


~ EBLUP 


Table 4 Performance of MSE estimators of 177 under 
model SSM* 

Estimator AARB AARB' ARMSE 

MS€pp 0.449 0.262 0.503 

msepy, 0.376 0.213 0.376 

mse yw 0.354 0.149 0.335 


All estimators overestimate the actual MSE, although 
mse, y overestimates less than the other two. From a 
detailed analysis of results related to individual areas, we 
have the values of AARB’ (that represents the most 
apparent difference with the results of Table 3) is driven by 
severe overestimation of actual MSE in areas characterized 
by the lowest levels of skewness and kurtosis. For these 
areas 6~ largely overstates actual variation in the data, thus 
leading to overestimation of g, (o.: o. ). This is likely to be 
due to the fact that the failure of normality (the excess of 
kurtosis) causes the overestimation of o7. This problem did 
not appear in the case of model MM5 because of the 
presence of covariates and the AR(1) modeling of individual 
residuals. 


5. Concluding remarks and further developments 


The results obtained show that, in general, EBLUP 
estimators derived from unit level linear mixed model 
specifications that “borrow strength over time”, as well as 
over areas, provide a significant gain in efficiency compared 
with both the direct estimator and with other commonly- 
used design based estimators such as the optimal composite 
estimator and the GREG estimator. Moreover, the mean 
squared error of some of the longitudinal EBLUP estimators 
in question is considerably lower, on average over the areas, 
than that of the analogous cross-sectional EBLUP esti- 
mators. Among the model specifications used to derive 
EBLUP estimators, those with independent time and area 
effects, whether inclusive of the autocorrelation of residuals 
or not, appear the most efficient, offering a gain in 
efficiency of about 55-60% compared with the direct 
estimator. These results also hold when covariates are 
removed; in fact, they offer the chance to obtain reliable 
small area estimates even in the absence of covariates, 
provided that repeated observations of the same unit at 
several points in time are available. Besides the shrinkage 
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effect connected to EBLUP estimators appears moderate, 
reducing the need for ensemble or multiple estimation (Rao 
2003, Chapter 9). With regard to estimation of the MSE of 
the small area estimators in question, we noted that the 
jackknife estimator provides the best results being correct, 
on average, over the areas and thus more robust to any 
departure from the standard assumptions of linear mixed 
models. This finding may be of importance to all 
applications of normality-based linear mixed models theory 
to data set in which normality assumptions do not hold 
exactly, as in the case of income data. 
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Estimation of the coverage of the 2000 census of population in 
Switzerland: Methods and results 


Anne Renaud ! 


Abstract 


Coverage deficiencies are estimated and analysed for the 2000 population census in Switzerland. For the undercoverage 
component, the estimation is based on a sample independent of the census and a match with the census. For the 
overcoverage component, the estimation is based on a sample drawn from the census list and a match with the rest of the 
census. The over- and undercoverage components are then combined to obtain an estimate of the resulting net coverage. 
This estimate is based on a capture-recapture model, named the dual system, combined with a synthetic model. The 
estimators are calculated for the full population and different subgroups, with a variance estimated by a stratified jackknife. 
The coverage analyses are supplemented by a study of matches between the independent sample and the census in order to 
determine potential errors of measurement and location in the census data. 


Key Words: Census; Coverage errors; Dual system; Multi-stage sampling plan; Measurement errors. 


1. Introduction 


In any census, some people are not enumerated and 
should be, while others are counted twice or should not have 
been enumerated. There is both undercoverage and over- 
coverage, and quite often, the combined result is net under- 
coverage. For example, net undercoverage is estimated at 
1.6% in the United States in 1990 (Hogan 1993), 2.2% in 
the United Kingdom in 1991 (Brown, Diamond, Chambers, 
Buckner and Teague 1999) and 3% in Canada in 2001 
(Statistics Canada 2004). By contrast, in the United States in 
2000, there is estimated to be net overcoverage of 0.5% 
(Hogan 2003). Coverage deficiencies may vary greatly 
between subgroups of the population. In the United States in 
2000, blacks were found to have a net undercoverage of 
1.8%, while whites had an overcoverage of 1.1%. Also, 
values often vary between age classes and regions, for 
example. These coverage deficiencies, and other errors such 
as measurement errors, result in a biased picture of the 
population. They are therefore studied in order to obtain 
information on the quality of the available data and to find 
ways to improve censuses of the population. 

The 2000 population census in Switzerland gives a 
picture of the population on December 5, 2000. In this 
article, coverage deficiencies in a Swiss census are esti- 
mated for the first time. Undercoverage, overcoverage and 
net coverage resulting from the 2000 census are all 
analysed. Undercoverage is estimated from a sample of 
individuals S,, independent of the census, on which a 
coverage survey was organized a few months after the 
census (collection took place in April and May 2001). The 
data from the survey are matched with data from the census 
to determine whether persons in S, were enumerated. 


Overcoverage is estimated from a sample of individuals S,, 
drawn from census records. A search for duplicates and 
other erroneous records then serves to determine whether a 
given record corresponds to a real person to be enumerated. 
Net coverage is estimated on the basis of a capture-recapture 
model known as the dual system (Wolter 1986, Fienberg 
1992). The dual estimator is applied in homogeneous cells, 
and the results are recombined using a synthetic model to 
obtain results for different domains of the population 
(Hogan 2003). The purpose of the project is not to adjust the 
census figures but rather to obtain information on the quality 
of the 2000 census and potential improvements for future 
censuses. 

This article describes the different steps followed in 
obtaining estimates, then presents the results. Sections 2 and 
3 describe the data sets and the coverage estimators. Section 
4 provides the details on constructing the different statuses 
used in the estimators. Section 5 describes the approach 
used to compare the values collected in the census and in the 
survey for the matched persons from S,. Sections 6 and 7 
present the numerical results and the conclusion. 


2. The three data sets 


2.1 Census 


The 2000 census was conducted under the auspices of 
the Federal Statistical Office, with the reference date of 
December 5, 2000. Information was collected for 7.3 
million inhabitants, 3.1 million households, 3.8 million 
dwellings and 1.5 million buildings. The different levels 
were then linked by common identifiers when the data were 
processed. 


1. Anne Renaud, Service de méthodes statistiques, Office fédéral de la statistique, Espace de |’Europe 10, CH-2010 Neuchatel, Switzerland. E-mail: 


Anne.Renaud@bfs.admin.ch. 
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The collection of information on persons and households 
was the responsibility of Switzerland’s 2,896 political 
communes. The latter had a choice between different 
methods of collection: 

— TRADITIONAL: use of census agents; 

— SEMI-TRADITIONAL: pre-printed questionnaires 
based on the communal register of inhabitants are 
mailed out and then collected by census agents; 

— TRANSIT: pre-printed questionnaires are mailed out 
and mailed back; 

— FUTURE: identical to TRANSIT except that links 
between households and dwellings are supplied by 
the commune; 

— TICINO: similar to TRANSIT but limited to the 
canton of Tessin. 


Most of the SEMI-TRADITIONAL, TRANSIT, 
FUTURE and TICINO communes also offered the option of 
completing questionnaires online. The 2,208 SEMI- 
TRADITIONAL, TRANSIT, FUTURE and TICINO 
communes that used the pre-printing of questionnaires 
based on communal registers of inhabitants account for 
nearly 96% of the population. For most of these communes, 
the tasks of mailing out questionnaires and controlling their 
return were organized at a national centre. 

The data set for individuals contains 7,452,075 entries. 
One feature of this data set is that it contains two records for 
the same person if that person has two residences (2.3% of 
the population; for example, a student who both resides with 
his parents and has a residence close to his school). In the 
case of two residences, one is coded as the economic 
residence and the other as the civil residence. The economic 
residence is the place where the person spends the most time 
per week and the civil residence is where the person’s 
official papers are kept (birth certificate for Swiss citizens, 
residence permit for foreigners). Where there is just one 
residence, it is both the economic and the civil residence. 
Switzerland is considered to have a resident population of 
7,280,010 based, on the set of records showing the 
economic residence. 

Households are classified as private, collective or 
administrative. Examples of private households are families, 
couples and persons living alone. Examples of collective 
households are groups of occupants of a home for the aged 
or a boarding school or the inmates of a prison. 
Administrative households group together people with no 
fixed residence, travellers and persons - by building or 
commune - who could not be assigned to private or 
collective households (2.4% of the resident population). 

Census data contain no imputation at the record level, 
since communes sent basic information for non-respondents 
(unit non-response). However, values are imputed in the 
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case of missing data or inconsistency in questionnaires (item 
non-response). 

The population of interest for coverage estimates is the 
resident population (based on economic residence) in 
private and administrative households. Collective house- 
holds, which account for 2.3% of the enumerated resident 
population, are excluded from the estimates. 


2.2 Spsample, coverage survey and matching 
(undercoverage) 


The size objective for the S, sample is set at 
approximately 50,000 people. In the absence of existing 
frames in Switzerland, this value was determined approxi- 
mately, based on experiences in other countries. In 
particular, the Australian results for 1996 were used, since 
the sampling plan for Australia’s coverage survey was 
similar to the one for Switzerland in 2000 (ABS 1997). 

The S, sample, which is independent of the census, is 
constructed in two parts: the canton of Tessin (TICINO) and 
the rest of Switzerland (NORD). Both parts use a multi- 
stage draw. The first stage consists in selecting 303 primary 
units - these are political communes for TICINO and postal 
codes for NORD - according to a stratified plan with a draw 
proportional to the number of buildings. The second stage 
consists in a simple random draw of a fixed number of 60 
buildings per primary unit. In the NORD plan, these 
buildings are allocated to a maximum of three mail delivery 
routes, based on an intermediate sampling stage. The 
sampling is thus constructed so as to consolidate the field 
work while limiting the variability of the weights. For 
practical reasons and in light of available resources, postal 
codes that include a large proportion of buildings lacking 
complete postal addresses or coded as unoccupied are 
selected with a lower probability than other postal codes. 
These tend mainly to be postal codes in rural areas or 
industrial zones, which are unlikely to exhibit major 
coverage deficiencies. With the assistance of postal 
employees, complete lists of households are drawn up in the 
field within the sample of approximately 16,000 buildings. 
A sub-sample of buildings is then drawn so as to obtain a 
total of approximately 27,000 households. For more 
information on the sampling and survey procedure, see 
Renaud (2001) and, in greater detail, Renaud and 
Eichenberger (2002). 

The coverage survey consists in contacting the 27,000 
households - by telephone if a telephone number is found 
and in person if not. The variables collected are those that 
lend themselves to matching with the census and defining 
subgroups of interest for the coverage study (socio- 
demographic variables, addresses). The collection operation 
covers all members of all households in the selected 
buildings. 


Survey Methodology, December 2007 


The final sample S, contains n, = 49,883 people in the 
population of interest (persons listed at their economic 
residence and residing in a private household). Of the 
households contacted, 88% were reached by telephone and 
12% in person. The weighting depended on the sampling 
and an adjustment for non-response. The adjustment for 
non-response was based on a homogeneity model in cells 
constructed on the basis of the sampling strata and whether 
or not a telephone number was known to exist (interviews 
conducted by telephone or in person). It also incorporated 
an estimate of the proportion of true households among the 
households to be contacted, since a sizable portion of the 
households to be contacted actually consisted of vacant 
dwellings, stores or businesses. No calibration was applied, 
since the auxiliary data available were not independent of 
the census. There was no partial non-response. The 
weighting details are documented in Renaud and Potterat 
(2004). 

Based on the questions asked in the survey and various 
plausibility controls, we hypothesize that the S, data are 
correct and usable for matching with the census. The quality 
criteria used are as follows: 


— completeness: the record is sufficient to identify 
the person; 

— appropriateness: the person should have been 
enumerated; 

— uniqueness: the person is listed only once; 

— belonging to population of interest: the person is 
listed at his/her economic residence and in a 
private household; 

— correctness of location: the person is listed at the 
correct address on Census Day. 


The matching between the S, sample and the census 
serves to determine the matching status P, of each element 
j of S,. Status P, is equal to | if the element is matched 
in the census (enumerated person) and 0 if this is not the 
case (person not enumerated). In our case, the data collected 
in the coverage survey, the final census data and images of 
the census questionnaires are used for automatic matching, 
manual matching and controls. No supplementary interview 
took place in addition to the coverage survey. Persons who 
moved between Census Day and the day of the survey were 
sampled at their address on the day of the survey and then 
searched for on a priority basis at the address they had 
indicated for the day of the census. No case was unresolved 
by the end of the process. 


2.3 S, sample and search for erroneous records 
(overcoverage) 


The size objective for the S,, sample was set at approxi- 
mately 55,000 persons. This value, somewhat greater than 
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that of S,, had little influence on the processing of the data, 
since there was no field work or interview supplementary to 
the census. 

The S;, sample was selected from the census data using 
a two-stage draw. Only elements included in the population 
of interest were eligible (records at the economic residence, 
without members of collective households). The primary 
units of S, were identical to the primary units of S, 
(postal codes and communes). However, the list of postal 
codes in the NORD plan that was used for S, did not 
correspond exactly to the list of postal codes that were 
present in the census data. Census records found in postal 
codes that did not exist in the list used for S$, were 
therefore reallocated to existing codes, taking geographic 
location into account (this involved assigning fictitious 
postal codes for the sampling). In the second stage, records 
were drawn from the population of interest using a simple 
random plan, without intermediate stages. The allocation 
was done in such a way as to obtain constant weights in the 
sampling strata of the primary units. In the end, the sample 
contained n, = 55,375 records (Renaud 2003). 

We hypothesize that S, records are sufficient to identify 
persons (completeness), since there is little imputation in the 
census data and most questionnaires were pre-printed based 
on registers of inhabitants. Appropriateness and uniqueness 
were determined in a matching between S,, and the rest of 
the census using a procedure similar to the matching 
between S;, and the census. In our case, this involves a 
search for duplicates or triplicates of elements of S,,, 
supplemented by an analysis of suspect cases in S,. An 
element j is considered appropriate if it is not considered 
erroneous in the analysis of suspect cases (e.g., a note on the 
questionnaire indicating that the person has gone abroad). 
An element j is considered unique if no duplicate or 
triplicate is detected in the census. There is no 
supplementary interview for S,. There is therefore no 
information supplementary to the census for S, persons 
(actual location? actual type of residence or household’). 
The search for duplicates/triplicates and suspect cases 
results in an enumeration status E, for each element j of 
S,. Status E, is equal to | if the element should indeed 
have been enumerated in the census (default value) and 0 if 
it should not have been enumerated. In practice, it can take 
on values between 0 and | if the case is not determined 
precisely. Thus, duplicates and _ triplicates receive 
respectively the values 1/2 and 1/3 if there is no information 
allowing the correct record to be determined from among 
the records detected. These cases, which are rare, consist of 
persons who completed more than one questionnaire in the 
census without any link having been made between those 
questionnaires during the processing of the data. 
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3. Coverage estimators 


3.1 Undercoverage and overcoverage 


The undercoverage rate is estimated by Rie. = 1— Ry» 
where R, is the estimate of the correct matches rate based 
on the S p sample. Similarly, the overcoverage rate 1s 
defined as R,. = 1- R., where R. is the estimate of the 
correct records rate based on the S,, sample. The correct 
matches rate and the correct records rate are estimated by 
the weighted means of matching status P, and enumeration 
status E p as follows: 


- pe Wp j . Pleo Mg 
where w, , is the weight of element j of sample S, and 
wp; 18 the weight of element j of sample S,. We note 
that the denominator of R, is the sum of the weights w, , 
of S;,, and not the number C of known records in the 
census, so as to have a potentially less biased estimator. 

The estimate of the undercoverage and overcoverage 


rates in a domain d is given by Radw yale R,, , and 
Res qgui- R. a With 
as Du jes, WP. Lia 5 eae iF; Jia 


Rig = and Rig = 


: pore) 
Dieu Wey lia Dailly We, J Jia 


Identifiers J,, and J,, take on the value | if element /, 
respectively of S, and S,, is found in domain d; 
otherwise their value is 0. 


3.2 Net coverage 


The net undercoverage rate is estimated by Regier = 
1—R,, where R,, = C/N is the estimate of the net 
coverage rate, C is the number enumerated in the 
population of interest and N is the estimate of the true total 
in the population of interest. If Recgrrene is negative, there is 
net overcoverage. 

The estimate of the true total N is based on the dual 
model (Wolter 1986). This model is built on the principle of 
capture (census) and recapture (coverage survey). It is 
applied in estimation cells k =1,..., K in order to best 
satisfy the assumptions of the model; see discussion below. 
Thus, the estimate of the true total N is composed of the 
sum of the estimated true totals N , in disjoint estimation 


cells covering the population of interest k = 1, ..., K: 
A K A 
Nea ie (3) 


The estimated totals N , have the form given by the dual 
model: 
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where N,, , is the total of records correctly counted in cell 
k during capture (census), V,,, is the total in & during 
recapture (estimated from sample S,) and N,,, 1s the 
number of records common to the two lists (estimated from 
matches between S, and the census). 

The different terms of equation (4) are estimated using 
undercoverage and overcoverage estimates. This is an 
extension of the model in Wolter (1986), similar to the one 
used by Hogan (2003). Thus, the total of the records 
correctly counted in the census N,, , is estimated by the 
enumerated total C, multiplied by the correct records rate 
R. , to take account of overcoverage. Also, the ratio 
between the total in the recapture NV, , and the number of 
records common to the two lists NV,,, is estimated by the 
inverse of the rate of matching R,, , between the coverage 
survey and the census in order to take account of 
undercoverage. We obtain 

v= [CR IR, a] ma 


A 


CAR, Rnd = CFs ©) 
where F, = R. P Ry , 18 the coverage correction factor in 
cell k. Factor BF combines the effects of overcoverage and 
undercoverage of cell & estimated from samples S, and 
S,. We note that undercoverage in one domain may be 
offset by overcoverage in the same domain. Thus, nil net 
undercoverage in a domain does not mean that no coverage 
deficiency exists in it. 

The proposed estimates are based on the assumptions of 
the dual model, the choice of estimation cells, and the 
choice of the statuses defining the estimators Rey and 
Re ; [The dual model is useful since it takes into account 
the fact that some persons are reached neither by the census 
(capture) nor by the coverage survey (recapture). However, 
a series of conditions must be met to avoid estimation 
biases. The coverage survey and the census must be totally 
independent. The matching must be of very high quality. 
The model must be applied in cells with persons who have 
the same probability of being enumerated in the census and 
the survey respectively; see Section 3.3. Lastly, the 
population must not change too much between Census Day 
fa the day of the survey. As to the estimators R. -, and 

R,,,,2 they are based on the quality of the matching and the 
search for erroneous records. Also, it is necessary to ensure 
that the definition of a correct match in S$, and the 
definition of a correct record in S, are identical, i.e., that 
there is a balance between overcoverage and undercoverage; 
see Section 4. All those elements are taken into 
consideration insofar as possible in the present estimates. 

The estimate of net undercoverage 1 ina domain d has the 
form R =1-R =4- C)/ NO Where Cys 


netunder, d net, d 
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the enumerated number in the domain and N ~ as ithe 
estimate of the true total. The estimate of the true total NV, 
is based on a synthetic model that assumes that the 


correction factor is fixed in each cell & = 1, ..., K: 
A K A K aA 
Ng = Ma = C4 Fy. (6) 
k=1 k=1 


C,., 1s the number enumerated in the population of 
interest in the intersection between cell k and domaind, 
and BF is the correction factor for the coverage in cell k. 
The hypothesis of the synthetic model is satisfied if the 
behaviour of any subset in the cell is identical to that of the 
entire cell. This homogeneity is best controlled by the 
choice of cells. Here we are using the homogeneous cells 
defined by the dual model. 


3.3 Estimation cells 


The estimation cells A =1,..., K are constructed in 
such a way as to group together elements that have 
homogeneous probabilities of enumeration in the census and 
the survey respectively (dual hypothesis) and homogeneous 
net coverage rates (synthetic hypothesis). We want a 
minimum of 100 persons per cell in S, and S, in order to 
control the variance and limit the estimation bias. The 
variables defining the cells are selected using a logistic 
regression model and a discrimination method applied to the 
data from S, (binary variable: P,). The three most 
influential variables are cross-tabulated: nationality in two 
categories, marital status in two categories and size of 
commune in three categories. The other variables are then 
successively integrated. Groupings are created when the cell 
sizes are too small (official language of commune in two 
categories, age class in seven categories and sex in two 
categories). In the end, 121 estimation cells are obtained; see 
Renaud (2004) for more details. 


3.4 Variance of coverage estimators 


The variance of the estimators is estimated by a stratified 
jackknife applied to the (identical) primary units of S, and 
S,. We note that the variance of the estimated under- 
coverage iB Vp =|- R, is equal to the variance of the 
estimated matching rate R,. Similarly, the variance of the 
overcoverage Rw =|- Ri is equal to that of the correct 
record rate Re: and the variance of the net undercoverage 
ico =l- Rd is equal to that of the net coverage rate 
Reet 

Let 6 be the parameter of interest taking the form of a 
weighted mean of statuses in the case of undercoverage and 
overcoverage, and the form of a linear function of quotients 
between two weighted means in the case of net under- 
coverage. Its estimator is 6, 
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Let h = 1, ..., H_ be the stratum used in the first stage of 
sampling, i = 1, ..., m, the number of the primary unit in 
stratum h (postal code for NORD or commune for 
TICINO), and 7 = 1, ..., 1, the number of the person in 
primary unit i of h. For the needs of the jackknife method, 
samples S, and S, are partitioned, in each stratum /, in 
m, Subsets corresponding to the persons in primary units 
6 fea) asta Pe 

Let 6, ho) be the estimator having the same form as 6 but 
calculated on the sample from which primary unit a of 
stratum h. has been removed. We note that estimators 


Rie, t and Reon f k =1,..., K are combined to form 
Ray 
ar ral 
Ret (rer) zs =p C, ad : (7) 
k=1 m(ha), k 


The corrected weights w,,, used to calculate values 
Rin) 8d Roy.) have the following form: 


0 if i=o 


if aehandi#¥a (8) 


Wri if agh. 


This form of correction is preferred to the quotient 
between the sum of the weights of the elements in the 
stratum and the sum of the weights without primary unit a 
since it allows us to take account of the variability due to the 
unknown number of elements tn the stratum. 

The jackknife estimator becomes: 


enya oe ant Pha (9) 


ea ps Ld 

with pseudo values 6,, = m,0—(m, —1) One, The 
estimator of its variance can take different forms; see the 
example of Shao and Tu (1995). We apply the following 


form: 
a Wh == 1 ieee A A 
VOK) = Da ile *: pO Ga 7 845)" (10) 


with Grn =e CeaW, m,. Lastly, we use v(0,,-) as an 
estimator of the variance of 6. The estimates in the 
subgroups use the same form of estimator with integration 
of a domain indicator in the construction of Ona No 
correction for the finite population is applied in the 
estimates. Also, other variabilities are not taken into 
account, such as the variability induced by the weighting 
model for non-response in S,.. 

Problems, such as the lack of stability of estimation in 
strata with few primary units, appeared in the course of 
applying this approach. However, tests on the sharing of 
some primary units and a comparison with the Taylor 
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linearization or a simple jackknife suggest that the esti- 
mators of variance by stratified jackknife that are presented 
in this document are fairly conservative. 


4. Choice of correct matching 
statuses and enumerations 


A key element of coverage estimates is the definition of 
the correct matching status for the elements of S, and the 
correct enumeration status for the elements of S,. These 
correct statuses are defined on the basis of frames P, and 
£, determined during the matchings. 

Is a match with a census element that is part of a 
collective household accepted as a correct match for an 
element of S,, or is this a case of undercoverage of the 
population of interest? Is a duplicate outside the population 
of interest for an element of S, really considered a 
duplicate, and hence an instance of overcoverage, or should 
it be excluded? A clear definition is needed. Also, the 
statuses used in estimates of net undercoverage must be 
chosen in such a way as to satisfy the balance between over- 
and undercoverage; see the concept of “balancing,” as, for 
example, in Hogan (2003). A match (P, = 1) with an 
element outside the population of interest may, for example, 
be rejected as a correct match (correct match status = 0, no 
undercoverage) only if the search for correct records would 
also detect this element as incorrect because it is out of 
scope (correct enumeration status = 0, no overcoverage). 

The criteria for defining correct statuses are constructed 
using information available for elements of S, and S,. As 
regards S,, we start with the assumption that census 
records that were matched with elements of S, serve to 
identify persons (completeness) and these persons should 
indeed have been enumerated (appropriateness). We also 
consider that they are unique, since uniqueness, while 
controlled by matches, is achieved in the great majority of 
cases controlled in S,. The criteria of belonging to the 
population and correctness of location are controlled by 
comparison with the information collected in the coverage 
survey, considered as reference information. No supplemen- 
tary data collection was organized to resolve ambiguous 
cases. As regards S',, we have the criterion of completeness 
considered as having been met in the census data and the 
results concerning uniqueness and appropriateness obtained 
in the matching with the rest of the census. For duplicates 
and triplicates, we define E,=1/ d', with d' = number of 
duplicates/triplicates in the population of interest according 
to the census. The criteria of belonging to the population of 
interest and correctness of location for the elements of S;, 
cannot be controlled, since we have no reference data 
supplementing the census. 
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For estimates of net undercoverage, it is important to 
meet the balancing requirement. The criteria used in 
defining correct statuses are thus completeness, appropriate- 
ness and uniqueness. The criteria of belonging to the 
population of interest and correctness of location cannot be 
considered, since they are not usable in defining the correct 
enumeration status. The criteria of completeness, appro- 
priateness and uniqueness are already integrated into the 
construction of the basic statuses P, and E ; Thus, esti- 
mates are made with basic statuses P, and E.,. 

For estimates not using the dual system and the need for 
balancing, it is possible to use other criteria to define correct 
statuses. Other types of correct match statuses are used in 
the analysis of potential measurement errors in Section 5 
and the more detailed analyses of matches and enumerations 
presented in Renaud (2004). 


5. Comparison of matches 
5.1 Potential measurement errors 


Measurement errors or classification errors are related to 
coverage errors. A person who is classified in domain d 
according to the census (e.g., a person between 10 and 19 
years of age) but who in reality is outside the domain (e.g., a 
person 60 years of age) would end up as a case of 
overcoverage in domain d and an undercoverage case 
outside that domain. This misclassification does not cause a 
coverage error at the overall level, but it causes an error at 
the level of subgroups of the population. 

The reasons for differences between the values collected 
in two surveys such as the census and the coverage survey 
may be quite varied and difficult to dissociate. It is to be 
expected that there will be matching errors, differences 
resulting from collection methods (paper questionnaire or 
telephone/face-to-face interview) and data processing 
methods, or real differences due to the time lag between the 
collection periods (December 2000 and April-May 2001). 
Also, it is difficult to determine the correct response if there 
are two different values. What is the correct choice - the 
census? the survey? another value not collected? 

Potential measurement errors with respect to census data 
are analysed on the basis of a set of matches between the 
independent sample S, and the census. We choose to 
determine which variables show respectively few or many 
potential classification problems, without making a 
judgment on the quality of either data collection. One use of 
this information is to evaluate the choice of estimation cells 
for the dual system and select subgroups for which the 
estimates of coverage deficiencies are soundest. 

For the category variable X, we define the matching 
rate in the good domain R 'y as follows: 
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Ry = D jeSpaa WP nom (11) 


j¢Spmatch gah 

where w, , is the weight of element j of matched sample 
Sp (Sp match) and the classification status P, , is equal 
to 1 if element j appears in the same class in the census 
and the survey, and 0 otherwise. The value of Ry is 
estimated with the set of matched elements and with the 
subgroup of elements without imputation in the census. 

We also define a measure of asymmetry ~,(d, d’) for 
classes d and d' of variable X: 


Ox(d, d) = Du jes,maen WPL) 4, 7) 
Dale. ble Wp ,1,(d, d) 


where J; (d, d')=1 if element 7 appears in domain d 
according to the survey and in domain d’ according to the 
census, and 0 otherwise. The factor @,(d, d’) is equal to | 
if there is a balance in the classification errors - in other 
words, if the number of elements in d according to the 
survey and d’ according to the census is equal to the 
number in d’ according to the survey and in d according to 
the census. The further the factor lies from 1, the less 
balance there is. 


(12) 


5.2 Potential location errors 


Comparisons between the census and the survey can also 
be used to study people’s geographic location. In the census 
data, we have a unique address if the person has a single 
residence and two addresses - principal and secondary - if 
the person has two residences. In the survey data, we have 
one or two addresses on Census Day, one or two addresses 
on the day of the survey and information on a possible move 
between the two dates. If a person has a single residence and 
has not moved, that person’s principle address on Census 
Day and his/her principle address on the day of the survey 
are identical. The person does not have secondary addresses. 

Different measures of distance are considered in order to 
determine potential location errors in the census. For 
practical reasons, including the data available, we define 
geographic areas around the person’s principle address 
collected in the survey for Census Day (reference address). 
The areas are sets of political communes. They are defined 
on the bases of postal codes identified in the survey. The 
person’s basic area is defined by the set of communes that 
have buildings within the postal code of the person’s 
reference address. The definition of this area uses data from 
the Swiss building register, since the latter has information 
on buildings’ postal address and the commune within which 
they are located. The extended area includes the communes 
within the basic area and the set of communes adjacent to 
them; see Renaud (2004) for examples. 
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Like classification errors, location errors do not cause 
coverage errors at the overall level but they cause errors at 
the level of subgroups such as regions or types of 
communes. Different rates may be defined. We will retain 
the basic location rate and the extended location rate, both 
weighted by w,, the weight of element j of matched 
sample S,. The location status takes on the value | if the 
element lies within the basic area or the extended area, as 
the case may be; otherwise it equals 0. In particular, we will 
study the correctness of location of persons who have 
moved, in order to detect possible problems relating to the 
time lag between Census Day and the actual day of 
collection of census data. 


6. Results 
6.1 Estimates of coverage deficiencies 


The overall net undercoverage rate is estimated at 1.41% 
with a standard deviation of 0.12%. The overcoverage rate 
is 0.35% (standard deviation=0.03%) and the under- 
coverage rate is 1.64% (standard deviation = 0.11%). These 
results are of the same order of magnitude as those of other 
countries, although they are in the lower range; see Table 1. 

Overcoverage is minor in the great majority of the 
domains studied. The highest rate is observed for persons 
between 20 and 31 years of age (0.93% with a standard 
deviation of 0.09%); see Table 2. However, undercoverage 
is high in several domains. For example, a rate of 8.03% 
(standard deviation = 0.85%) is observed for foreigners with 
temporary settlement permits (“other permits’) and a rate of 
3.50% (standard deviation = 0.50%) is observed for 20-31- 
year-olds. Also, an undercoverage rate of 2.4% is observed 
in the Italian-speaking region of the country (language of 
commune: Italian; NUTS region: Ticino, and collection 
method: TICINO). However, the results are related to 
relatively great variability (standard deviation of approx. 
0.5%), since samples S, and S, include only 1,500 and 
1,700 persons respectively in this region. 

Net undercoverage is positive in all the domains studied. 
There is therefore no net overcoverage. The highest values 
are observed for foreigners with permanent or temporary 
permits (2.89% and 3.48%, standard deviations = 0.32 % 
and 0.39%) as well as for 20-31-year-olds (2.84%, standard 
deviation = 0.36%). No significant difference is observed 
between males and females, between languages or between 
NUTS regions. Because of the small size of the sample with 
the collection variant TICINO, this method cannot be 
differentiated from the others used in the country. On the 
other hand, significant differences are observed between 
marital statuses, as well as between types and sizes of 
communes. 
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Table 1 

International comparison of overall results. Estimated rates of 
overcoverage R undercoverage Rae, and net under- 
coverage Rounder, With corresponding estimated standard 
deviations. References: Statistics Canada (1999, 2004), Hogan 
(1993, 2003), McLennan (1997) and Trewin (2003) 


over? 


Overcoverage Undercoverage tema 
[%] [%] [%] 
Switzerland 2000 0.3 (0.03) 1.64 (0.11) 1.41 (0.12) 
Canada 1996 0.74 (0.04) 3.18 (0.09) 2.45 (0.10) 
2001 0.96 (0.05) 3.95 (0.13) 2.99 (0.14) 
United States 1990 34 4.7 1.6 (0.10) 
2000 - - -0.5 (0.21) 
Australia 1996 0.2 1.8 1.6 (0.10) 
2001 0.9 Doi 1.8 (0.10) 


We note that the net undercoverage rate is greater than 
the undercoverage rate in the case of permanent settlement 
permits. This effect, which is unrealistic, is due to the choice 
of estimation cells and the resulting smoothing. The 
construction of the cells made it necessary to group 
foreigners with permanent and temporary permits into a 
single category for aggregates so as to obtain the minimum 
size of 100 persons per cell. By making this grouping, we 
are treating foreigners as a homogeneous group, whereas 
this is not the case. This shows the limitations of the method 
and the difficulty of satisfying the assumptions of the 
models used in applying the approach. In the case of 
foreigners, we note, however, that the confidence intervals 
of the undercoverage and net undercoverage rates overlap. 
The consequences of the weaknesses of the application are 
therefore limited. 

It should also be noted that the results are presented in 
domains defined by variables for which low levels of 
potential measurement errors were observed. The fact is that 
results for groups as defined by household or labour market 
characteristics would not be very reliable; see Section 6.2. 

The precision of the results obtained is generally better 
than the objective set at the beginning of the project. That 
objective was to have a standard deviation of 0.3% for 
subgroups of 10,000 individuals in Sp. In the case of, for 
example, age classes 32-44 and 45-59, which have between 
10,000 and 12,000 persons, the standard deviations are 
0.19% and 0.14 %. 


6.2 Potential measurement and classification errors 


Of the 49,107 elements matched between the coverage 
survey and the census, 96% exhibit no difference in sex, the 
seven age classes, the three marital status classes and the 
three settlement permit classes (Swiss, permanent, 
temporary). The matching rate in the good domain R aes 
99.3% for sex (with and without imputations), 98.3% for 
marital status (98.4% for non-imputed values) and 98.7% 
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for settlement permits (98.8% for non-imputed values). The 
RA rate is 99.5% for age classes (with or without impu- 
tations). However, it should be noted that date of birth, 
along with surname and given name, was one of the main 
variables in the matching. Age differences are therefore 
possible only in the case of a non-automatic (computer- 
assisted or manual) matching. Three variables exhibit a 
matching rate in the good domain that is markedly lower 
than that observed for sex, age, permit and marital status. 
These are the variables for labour market status (in the 
labour market, unemployed, not in the labour market), 
position in household (alone, spouse, common-law union, 
person with child or children, other head of household, 
related to head of household, other; results limited to private 
households), and size of the person’s household (according 
to economic residence and in private households). The R . 
rate is 90.4% for labour force status (91.1% for non-imputed 
values), 91.4% for position in household (94.9% for non- 
imputed values) and 88.3% for household size. 

The measure of asymmetry 9 ,(d, d’) takes on the value 
1.33 for sex (d =male and d’ = female). There are more 
persons coded as males according to the survey who are 
coded as females in the census than there are females 
according to the survey who are males according to the 
census. The proportion of males is slightly higher in the 
survey. However, these results must be interpreted with 
caution, since they based on very few cases; see Table 3. A 
McNemar test is just significant at the 5% level without 
taking the design into account, but it is no longer significant 
at that level if the design is factored in. On the other hand, 
quite substantial asymmetries are observed for marital 
status. There are fewer single persons in the survey who are 
married in the census than the reverse (factor 0.33 for 
d =single and d’ =married). Similarly, there are fewer 
married persons in the survey who are widowed in the 
census than the reverse (factor 0.42 for d = married and 
d' =other). Asymmetry is also observed for the settlement 
permit variable. The tendency is to have more Swiss persons 
in the survey who are described as foreigners in the census 
than the reverse, and to have more permanent permits in the 
survey and temporary permits in the census than the reverse 
(factors 5.22 for d=Swiss and d’ =foreigner with 
permanent permit and 3.83 for d =foreigner with 
permanent permit and ad’ =foreigner with temporary 
permit). The factors calculated are based on few cases. 
However, they give an insight into the potential differences 
between data collection via the census questionnaire and a 
survey conducted mainly by telephone. The labour force 
status variable includes more divergent cases; see Table 4. 
Thus, for example, we observe fewer persons employed 
force in the survey and fewer persons not in the labour force 
census than the reverse (factor of 0.46 for d =in labour 


Survey Methodology, December 2007 


force and d’ =not in labour force). There are also fewer 
unemployed persons in the survey and persons not in the 
labour force in the census than the reverse (factor of 0.26 for 
d =unemployed and d’=not in labour force). The 
position-in-household variable also exhibits asymmetries, 
but these are based on few elements, since the dispersion of 


Table 2 Enumerated number C and estimated rates of overcoverage Bg undercoverage R 


coverage R.etunder 
Variable Category C 
Overall 7,121,626 
Sex Male 3,497,940 
Female 3,623,686 
Age class <9 810,373 
10-19 833,185 
20-31 1,115,804 
32-44 1,544,721 
45-59 1,431,771 
60-79 1,146,709 
> 80 239,063 
Settlement permit Swiss 5,674,266 
Foreigner, permanent 1,020,242 
Foreigner, temporary 427,118 
Marital status Single 2,975,643 
Married B5 e} 
Widowed 369,339 
Divorced 399,421 
Commune language German + Romansh S1255399 
French 1,680,062 
Italian SSPAll| 
NUTS region Région lémanique 1,296,464 
Espace Mittelland 1,640,489 
Nordwestschweiz 976,699 
Zurich 1,221,014 
Ostschweiz 1,020,897 
Zentralschweiz 665,904 
Ticino 300,159 
Commune size Small 1,372,958 
Medium 2,398,256 
Large 3,350,412 
Type City/town 2,078,780 
Agglomeration 3,145,541 
Rural 1,897,305 
Collection method TRADITIONAL 265,607 
SEMI-TRADITIONAL 174,501 
TRANSIT + FUTURE 6,381,359 
TICINO 300,159 
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the elements in the boxes (d, d') is sizable. The census 
variables at the household level (position in household and 
size of household) are influenced by the complex process of 
household formation. They are less reliable than those 
concerning persons. The values at the household level are 
more reliable in the survey. 


under 20d net under- 


for different domains [%o], with corresponding estimated standard deviations (SDs) 


A A 


Reve SD Runder SD Rnetunder SD 
0.35 0.03 1.64 0.11 1.41 UB es 
0.37 0.04 1.74 O13 1.46 O13 
0.33 0.03 1:55 0.10 139 0.13 
0.26 0.05 1.46 0.21 1.34 0.26 
0.27 0.05 1.30 0.19 1.04 0.22 
0.93 0.09 3.50 0.34 2.84 0.36 
0.33 0.05 1.65 0.16 1.43 0.19 
O22 0.04 1.18 0.14 1.04 0.14 
0.10 0.03 0.91 0.13 0.82 0.12 
0.11 0.06 1.20 0.31 1.03 0.27 
0.33 0.03 1.28 0.09 0.98 0.10 
0.33 0.06 1.85 0.29 2.89 0.32 
0.56 0.11 8.03 0.85 3.48 0.39 
0.50 0.05 2.07 0.18 INGE O19 
0.23 0.04 E27 0.11 12> OZ 
0.25 0.08 1223 0.26 0.79 0.13 
0.24 0.08 oS 0.35 1.02 0.10 
0.33 0.04 1.50 0.11 1.28 0.12 
0.35 0.06 1.89 0.25 eye 0.27 
0.53 0.12 235 0.49 1.56 0.19 
0.37 0.07 219 0.38 1.84 0.28 
0.35 0.09 oe 0.15 Z5 0.10 
0.18 0.04 1.50 0.27 1.32 0.12 
0.31 0.05 1.58 0.19 1.46 0.13 
0.40 0.07 lig 0.23 1.24 OnZ 
0.36 0.06 1.57 0.25 ig 0.12 
0.54 0.12 2.38 0.52 LF 0.19 
0.34 0.05 1.50 0.15 ie 4 0.14 
0.41 0.07 | imsy 0.16 1.07 0.19 
0.31 0.03 2.01 On19 eal 0.19 
0.35 0.04 1.96 0.17 1.82 0.20 
0.36 0.06 1.49 0.19 1.34 ab 
0:32 0.04 1.56 0.17 1.07 0.12 
0.39 0.05 1.91 0.28 1.07 0.12 
O37 0.08 1.07 0.24 1.16 0.13 
0.33 0.03 1.62 0.11 1.42 0.12 
0.54 0.12 D8 0.52 Ly) 0.19 
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Table 3 Comparison of values collected in the survey and the census for the sex 


variable 

Sex Survey 
Male Female Total 
Census Not matched Total 393 383 776 
Matched Total eA \T/i\ 24,936 49,107 
Matched Male 23,967 L66ne24133 
Female 204 24,770 24,974 
Matched Male 6 0 6 
(imputed value) Female 0 13 3 
Total 24,564 25,319 49,883 


Table 4 Comparison of values collected in the survey and the census for the labour force status variable 
Labour force status Survey 
Employed Unemployed Notin <I5years_ Total 
labour of age 
force 

Census Not matched Total 424 23 217 112 776 
Matched Total 25,163 498 14,501 8,945 49,107 
Matched Employed 23,953 188 2,007 13 26,161 
Unemployed 300 Pah 323 1 845 
Not in labour force 901 89 12,143 18 sty 
< 15 years of age 9 0 28 8,913 8,950 
Matched (imputed variable Employed 564 Ze Ola 6 904 
Unemployed 14 8 26 1 49 
Not in labour force 92 15 881 5 993 
< 15 years of age 0 0 0 0 0 
Total 25,587 pall 14,718 PALE EH 49,883 


6.3 Potential location and time lag errors 


Of the 49,107 elements matched between the coverage 
survey and the census, 97.7% are found within the basic 
area around the reference address collected in the survey. 
The corresponding value is 98.1% for persons who did not 
indicate any move between Census Day and the day of the 
survey. It is 83.9% for those who indicated a move (1,512 
persons); see absolute numbers in Table 5. 

It is worth noting that 9.4% of the persons in NORD who 
did not move are found close to their reference address but 
not in exactly the same building. While these problems of 
exact location have a negligible effect on the census data, 
they show the difficulty of identifying the buildings sampled 
when constructing lists of households in the field during the 
survey, as well as the difficulty of assigning persons to 
buildings during the processing of the census data. How- 
ever, a supplementary survey would be needed to evaluate 
the respective effects of these two difficulties. 
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Efforts to locate persons who moved indicate that 
151=145+6 persons were located near their address 
reported on the day of the survey and not near their Census 
Day address (9%, weighted). Also, a set of 688 persons in 
NORD, among the 922 located in the two basic areas, were 
actually found to be residing in the building on the day of 
the survey. During the coverage survey, special care was 
taken regarding questions on addresses on Census Day and 
on the day of the survey. We therefore believe that the 
addresses of persons who moved are of better quality in the 
survey data than in the census data. On this basis, we deduce 
that out of the 1,512 persons who moved, at least 
151 + 688 = 839 are enumerated in the census at an address 
that they did not have on the official day of data collection 
but at an address that they had some time after that date. The 
exact time lag is not known, since the moving date was not 
collected in the survey. 
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Comparison of the location of matched persons. The areas are defined for the address on Census Day (according to information 
collected in the survey) and for the address on the day of the survey (also according to information collected in the survey). 
Presence in the basic area, the extended area (outside the basic area) or outside the extended area for persons who did not move 
(stayed) and persons who moved (moved) between the census and the survey 


Day of survey 


Stayed 
Basic area 


Basic area 


Day Extended area 258 
Outside extended area 648 
Missing 0 
Total 47,595 


7. Conclusion 


Overall coverage deficiencies in the 2000 census of 
population in Switzerland are of the same order of 
magnitude as those for the censuses of other countries. 
However, differences are noted for subgroups (e.g., 
regions). Of the three components, undercoverage 1s of great 
interest, since it not only serves to detect groups of persons 
not well enumerated, but it also lends itself to analysing 
location and measurement errors. As to overcoverage 
estimates, these are limited by the lack of information 
supplementary to the census for S,. In the future, they 
could be improved by collecting supplementary information 
on characteristics reported on Census Day in a survey of 
persons in that sample (e.g., location and household type). 
Net undercoverage estimates are based on several assump- 
tions. The results in large domains seem reliable, but certain 
risks, notably related to the choice of estimation cells, exist 
when domains are smaller. For future estimates, we propose 
to evaluate the model approach applied in the United 
Kingdom instead of the estimation cells traditionally used in 
the United States. 

An important element to review for future estimates is 
the choice of the population of interest. The decision to limit 
that population to persons in private households and the 
economic residence led to a few problems in the estimates, 
since it was difficult to delimit that population precisely. In 
a future estimation, collective households could be excluded 
so as to avoid practical problems relating to collection but 
retain all types of residences. The set of records for the 
economic residence would then be treated as a domain. 

Estimating the coverage deficiencies of a census is an 
ambitious project that has proved to be worthwhile. The 
results provide information on the quality of the data from 
the 2000 census and the different coverage problems. 
Upcoming censuses will essentially be based on registers. 
Coverage estimates will be based on the experience 
acquired in making the 2000 estimates, with probable 


Basic area 


Moved 


Extended area Outside extended area Total 


adaptations to take account of the new data collection 
system. 
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Use of a web-based convenience sample to supplement 
a probability sample 


Marc N. Elliott and Amelia Haviland ! 


Abstract 


In this paper we describe a methodology for combining a convenience sample with a probability sample in order to produce 
an estimator with a smaller mean squared error (MSE) than estimators based on only the probability sample. We then 
explore the properties of the resulting composite estimator, a linear combination of the convenience and probability sample 
estimators with weights that are a function of bias. We discuss the estimator’s properties in the context of web-based 
convenience sampling. Our analysis demonstrates that the use of a convenience sample to supplement a probability sample 
for improvements in the MSE of estimation may be practical only under limited circumstances. First, the remaining bias of 
the estimator based on the convenience sample must be quite small, equivalent to no more than 0.1 of the outcome’s 
population standard deviation. For a dichotomous outcome, this implies a bias of no more than five percentage points at 50 
percent prevalence and no more than three percentage points at 10 percent prevalence. Second, the probability sample 
should contain at least 1,000-10,000 observations for adequate estimation of the bias of the convenience sample estimator. 
Third, it must be inexpensive and feasible to collect at least thousands (and probably tens of thousands) of web-based 
convenience observations. The conclusions about the limited usefulness of convenience samples with estimator bias of 
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more than 0.1 standard deviations also apply to direct use of estimators based on that sample. 


Key Words: Bias; Composite estimator; Calibration. 


1. Introduction 


Web-based surveys have steadily increased in use and 
take a variety of forms (Couper 2000). For instance, web- 
based probability samples use a traditional sampling frame 
and provide web-mode as one response option or the only 
response option. Web-based probability samples can have 
high response rates and produce estimators with minimal 
non-response bias (Kypri, Stephenson and Langley 2004). 
In contrast, web-based convenience samples are based on 
“inbound” hits to web pages obtained from anyone online 
who finds the site and chooses to participate (sometimes as 
a result of advertising to a population that is not specifiable) 
or based on volunteerism from recruited panels that are not 
necessarily representative of the intended population. 

The primary appeal of web-based convenience samples 
lies in the potentially very low marginal cost per case. Visits 
to a web site do not require expensive labor (as for phone 
calls) or materials (as for mailings) for each case, combined 
with rapid data collection and reductions in marginal data 
processing costs per case. Even with some fixed costs, the 
total costs per case are potentially very low, especially for 
large surveys. The disadvantage of these samples is also 
clear: potentially large and unmeasured selection bias. 

Most discussions of web-based convenience samples of 
which we are aware have either argued that probability 
samples are unimportant in general, tried to delineate the 
circumstances under which convenience samples may be 
useful, or dismissed the use of convenience samples 
entirely. We explore a different avenue by investigating the 


possibility of integrating web-based convenience samples 
into the context of probability sampling. 

In this paper we describe a methodology for combining a 
convenience sample with a probability sample to produce an 
estimator with a smaller mean squared error (MSE) than 
estimators that employ only the probability sample. We then 
explore the properties of the resulting composite estimator, a 
linear combination of the convenience and _ probability 
samples with weights determined by bias. This leads to 
recommendations regarding the usefulness of supple- 
menting probability samples with web-based convenience 
samples. Because the marginal costs of web-based con- 
venience samples are very low, we focus on identifying 
situations in which the increase in effective sample size 
(ESS) attributable to the inclusion of the convenience 
sample may be sufficient to justify a dual-mode approach. 
We demonstrate that there are limited circumstances under 
which a supplemental web-based convenience sample may 
meaningfully improve MSE. While we focus on web-based 
convenience samples, the discussion that follows applies to 
other low-cost data collection methods with poor population 
coverage. 


2. Problem context 


2.1 Initial conditions 


For the combined probability/convenience sample, we 
propose that the same survey be administered simul- 
taneously to a traditional probability sample (with or 
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without a web-based response mode) and a web-based 
convenience sample. We envision a multi-purpose survey 
with a number of survey outcomes. In this paper we will 
focus on the estimation of means, but future work might 
extend these results to other parameters, such as regression 
parameters. Although we will initially consider cases where 
the bias of convenience sample estimates is known, we will 
later consider the extent to which the probability sample 
provides a means of measuring the unknown bias in each 
parameter estimate from the convenience sample. 

With known bias, one may combine the convenience and 
probability samples in a manner that minimizes MSE. If 
estimates from the convenience sample are very biased, the 
convenience sample will accomplish little. This possibility 
requires that the probability sample be large enough to stand 
on its own. Thus, one approach would be to set aside a small 
portion of the probability sample budget to create a large 
convenience sample supplement. 

For example, consider a survey for which the primary 
interest is in estimates for the population as a whole, but for 
which subpopulations estimates would also be desirable if a 
sample size supporting adequate precision were affordable. 
Suppose further that one could draw 4,000 probability 
observations and 10,000 convenience observations for the 
cost of the probability sample of 5,000. For a given 
outcome, if bias is large, standard errors increase moderately 
through a small proportionate loss in sample size; if bias is 
small overall and within each subpopulation, there might be 
a “precision windfall,” allowing acceptably precise sub- 
population analyses. 


2.2 Initial bias reduction 


We will demonstrate that the bias of convenience sample 
estimators must be quite small for the sample to be useful, 
suggesting that it may be best to focus on estimating 
parameters that are typically subject to less bias than overall 
unadjusted population estimates of proportions or means, 
such as regression coefficients (Kish 1985). 

Additionally, one might reduce bias by calibrating the 
convenience sample to known population values (Kalton 
and Kasprzyk 1986) or by applying propensity score 
weights that model membership selection between the two 
samples to observations from the convenience sample 
(Rosenbaum 2002; Rosenbaum and Rubin 1983). A small 
set of items can be included to allow the use of either 
approach. These items might include both items that predict 
differences between respondents to web surveys and other 
survey modes, as well as items tailored to the content of the 
particular survey. The design effect from the resulting 
variable weights will reduce the ESS for convenience 
sample estimators, but the low costs of these observations 
makes compensating for moderate design effects affordable. 
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We then can estimate the remaining bias for a given 
parameter as the difference between the estimate in the 
probability sample and the weighted estimate in the 
convenience sample. 


3. Efficiency considerations 


3.1 Linear combinations of biased and unbiased 
estimators of a population mean 


The most efficient estimator that is a linear combination 
of the (weighted) convenience and probability samples is a 
special case of an estimator given in a result by Rao (2003, 
pages 57-58). The properties of this estimator lead to 
general recommendations regarding the conditions of 
probability sample size, convenience sample size, and 
convenience sample estimator bias under which the 
convenience sample meaningfully improves the ESS of the 
probability sample. 

We begin by asking: What is the most efficient estimator 
of this form when the magnitude of the bias is known? We 
will later consider relaxing the assumption of known 
magnitude of the bias. 

Let n, and n, be the effective sample sizes of the 
probability sample and convenience samples, respectively, 
after dividing nominal sample sizes by design effects 
associated with the sample design and non-response 
adjustments. This includes propensity score or other 
weighting in the case of the convenience sample. The 
former population has mean p, and variance OF; the latter 
has mean p+ and variance o}, where ¢ is the known 
bias remaining after weighting and pu is the unknown 
parameter of interest. The corresponding sample means 
have expectation 1» and wt+e and variance of /n, for 
i=1,2 under an infinite population sampling model. We 
assume these two estimators are uncorrelated, as they come 
from independent samples. 

From Rao (2003, pages 57-58), the most efficient 
composite estimator of 1. takes the form 


X, (07 /n,) +(e’ +:05/n) 


> 


i= 
& +0; /n, +0;/n, 


with remaining bias 


o; /n, 
biz & 2 ly ol 
So Oy Litt O71 
and 


MSE = (a; /n,)(e* +03/n,) 


c 


2 guest 2 
Sho, leery 
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As can be seen, the composite estimator is a convex 
combination of the convenience sample and probability 
sample means. The influence of the former is determined by 
the ratio of the MSE (here variance) of the probability 
sample mean to the sum of that term and the MSE of the 
convenience sample mean. Similarly, the remaining bias is 
the original bias multiplied by this same ratio, whereas the 
resultant MSE., is the product of the two MSEs divided by 
their sum. Note that bias approaches zero both as « > 0 (no 
selection bias in the convenience sample estimate) and as 
€ — oo (no weight given to the convenience sample). 


3.2 Quantifying the contributions of the 
convenience sample 


We now can evaluate the contributions of the 
convenience sample based on the known remaining bias in 
its associated estimators. To this end, we will define several 
quantities. 

Let 


ESS, 


us o; /n, we e +o; /n, +65/n, 7 
MSE, ' &° +05/n, 

be the effective sample size needed for an unbiased sample 
mean with the same MSE as the composite estimator. To 
further simplify this expression, let us define the remaining 
standardized bias, £ = ¢/o, and consider the case in which 
the observations from the convenience and _ probability 
populations have equal variance, (o,=0,=6). In this 
case, the increment to ESS, attributable to the convenience 
sample, the difference between ESS, with and without the 
convenience sample, is 


1 l 
———— =n, | ———~ |. 
L/n, +E? {| 


3.3 Maximum contribution of the convenience 
sample 


As n, >, the increment to ESS, approaches 1/£E’. 
This limit, the inverse of the squared standardized bias, is 
the maximum possible incremental contribution of the 
convenience sample to the ESS, (abbreviated MICCS). If 
the MICCS is small, then a convenience sample of any size 
cannot meaningfully improve MSE. If the MICCS is large 
enough to be meaningful, we then need to consider what 
convenience sample sizes are needed to achieve a large 
proportion of the MICCS. 

To develop intuition for the magnitude of F (standardized 
bias) we consider the important case of a dichotomous 
outcome, for which E=e/./P(1—P) where P is the 
population probability of the outcome. Table 1 below 
translates bias for a dichotomous outcome from percentage 
points to standardized bias and then to the corresponding 
MICCS for P=0.1 and P=0.5. 
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Table 1 Maximum contributions of convenience samples to 
the estimation of a proportion by bias in percentage 


points 
E Overall Prevalence of Outcome MICCS* 
(Standardized 10% 50% 
Bias’) Bias in Percentage Points 
0.01 0.3% 0.5% 10,000 
0.02 0.6% 1.0% 2,500 
0.05 1.5% 2.5% 400 
0.10 3.0% 5.0% 100 
0.20 6.0% 10.0% 25 


* Of estimators of means using only the convenience sample 
© ESS added with an infinitely large convenience sample relative to 
no use of a convenience sample. 


For a proportion near 50%, a bias of 2.5 percentage 
points limits the potential increment of ESS, to 400. The 
minimum increment to ESS, that offsets the fixed cost of 
setting up the web-based response mode will vary by user, 
but we suspect increments of less than 100 will rarely be 
cost-effective. Table 1 then implies that convenience 
samples for which the standardized biases of estimators 
restricted to the convenience sample generally exceed 0.1 
standard deviations will rarely prove cost-effective. For a 
dichotomous variable with P between 0.1 and 0.5 this 
corresponds to a bias of 3 to 5 percentage points. 

How easily are biases of this magnitude achieved with 
adjusted estimates from convenience samples? Several 
studies compared propensity-weighted web-based conve- 
nience samples to RDD surveys. One (Taylor 2000) 
advocated the stand-alone use of such convenience samples 
despite differences of as much as five percentage points in a 
number of estimates for dichotomous outcomes regarding 
political attitudes, with standardized bias of 0.05 to 0.10 if 
one treats RDD as a gold standard. Another (Schonlau, 
Zapert, Simon, Sanstad, Marcus, Adams, Spranca, Kan, 
Turner and Berry 2003) does not report magnitudes of 
differences, but does report that 29 of 37 items regarding 
health concerns exhibit differences that are statistically 
significant at p<0.01. Given the reported sample sizes 
(and optimistically ignoring any DEFF from weighting), it 
can be shown that significance at that threshold implies 
point estimates of standardized bias exceeding 0.05 for 
estimators of 78% of items. The key outcome in a Slovenian 
comparison of a probability phone sample and a Web-based 
convenience sample (Vehovar, Manfreda and Batagelj 
1999) would be estimated with a standardized bias of more 
than 0.1 from the convenience sample even after extensive 
weighting adjustments. It should be noted that there may 
also be mode effects on responses for the Web mode when 
compared to a telephone mode among subjects randomized 
to response mode (Fricker, Galesic, Tourangeau and Yan 
2005), so that not all differences between Web convenience 
samples and non-Web probability samples may result from 
selection. 
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3.4 Actual contribution of the convenience sample 


While the maximum possible increment (MICCS) is 
1/E”, the actual increment to ESS, can be expressed as 
(k/k+1)MICCS where kobe The shortfall of the 
actual increment to ESS, from the MICCS can then be 
expressed as MICCS-ESS, =1/[(E*)(1+n,E*)]. This 
implies that the returns to ESS, diminish with increasing 
size of the convenience sample, more quickly with large 
bias since the bias eventually dominates any further 
variance reduction. Half of the MICCS noted is achieved 
when the ESS of the convenience sample is equal to the 
MICCS. For example, if bias is 0.01 standard deviations and 
a convenience sample has an ESS of 10,000, then the 
MICCS is 10,000, but the actual incremental contribution to 
ESS, will be 5,000. This suggests that convenience samples 
with ESS 2-20 times as large as MICCS will suffice for 
most purposes, which correspond to 67%-95% of the 
potential gain in ESS. Such heuristics in turn imply 
collecting 200 - 4,000 such cases when F 1s relatively large 
(E =0.05 to 0.10) and 5,000 - 200,000 such cases when F 
is relatively small (E=0.01 to 0.02). Table 2 provides 
illustrative examples of the ESS, achieved at several 
combinations of sample sizes and bias. 


Table 2 Examples of ESS1 at several sample sizes and levels of 
standardized bias 


ny ny E ESS, for the ESS, /n;* 
(Probability (Convenience (Standardized Composite 


Sample Size) Sample Size) __ Bias") _ Estimate 
1,000 1,000 0.01 1,909 1.909 
1,000 1,000 0.10 1,091 1.091 
1,000 10,000 0.01 6,000 6.000 
1,000 10,000 0.10 1,099 1.099 
1,000 100,000 0.01 10,091 ‘10.091 
1,000 100,000 0.10 1,100 1.100 
10,000 1,000 0.01 10,909 1.091 
10,000 1,000 0.10 10,091 1.009 
10,000 10,000 0.01 15,000 1.500 
10,000 10,000 0.10 10,099 1.010 
10,000 100,000 0.01 19,091 1.909 
10,000 100,000 0.10 10,100 1.010 


Number of estimators of means using only the convenience sample 
* Of estimators of means using only the convenience sample 
® ESS relative to no use of a convenience sample. 


3.5 Precision for estimating bias 


Heretofore, we have assumed a known bias in 
convenience sample estimators; in practice, the bias will 
need to be estimated using information from both samples. 
We next explore the extent to which the size of the 
probability sample also constrains the usefulness of the 
convenience sample through the need to precisely estimate 
the remaining bias. 

We can estimate ¢ as the difference between the sample 
mean of the probability sample and the weighted mean of 
the convenience sample. The true standard error for the 
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estimate of bias is 6, =,/o; /n, +.05/n,. If 6, =0,=9, 
the true standard error for the estimate of standardized bias 
(E) is o, =,/l/n,+1/n,. No matter how large the 
convenience sample, this term can never be less than the 
inverse of the square root of the probability sample size. 

It has been demonstrated that the relative error in MSE 
for a composite estimator is relatively insensitive to small 
errors in the estimates of bias (Schaible 1978), which is 
encouraging for well-estimated biases. Unfortunately, unless 
both the probability and convenience ESS are large, the 
standard error of the estimate for EF is impractically large 
relative to the values of E that make the convenience 
supplement useful (F< 0.10). For example, suppose that a 
probability sample of ESS 1,000 and a convenience sample 
of ESS 5,000 yielded a point estimate of standardized bias 
of 0.02. If the point estimate were correct, the convenience 
sample would increase the ESS, by 1,667. But this estimate 
could also have a true bias of 0.088 standard deviations 
(95% upper confidence limit), which would imply that the 
increment would be less than 130. 

If we assume that the convenience sample size will 
always be at least twice the probability sample size, these 
results imply that practical applications of this technique 
must have a minimum sample size of 1,000-10,000 for the 
probability sample if they are to address the uncertainty in 
the magnitude of bias in convenience sample estimators 
(standard errors of F in the 0.01 to 0.04 range). 


4. Discussion 


We describe a composite estimator that is a linear 
combination of an unbiased sample mean estimate from a 
probability sample and a biased (propensity-score weighted) 
sample mean estimate from a web-based convenience 
sample. We use the MSE of this composite estimator to 
characterize the contributions of the convenience sample to 
an estimator based only on the probability sample in terms 
of ESS. We then calculate the maximal contribution of the 
convenience sample, the role of the convenience sample 
size in approaching this limit, and the roles of both sample 
sizes in estimating bias with sufficient precision. 

Practitioners sometimes assume that small probability 
samples are sufficient to estimate the bias in estimates from 
corresponding convenience samples. Our results suggest 
otherwise. We demonstrate that the standardized bias of 
web-based convenience sample estimators after initial 
adjustments to reduce bias must be quite small (no more 
than 0.1 standard deviations, and probably less than 0.05 
standard deviations) for the MSE of the overall estimate to 
be meaningfully smaller than it would be without use of the 
convenience sample. We_ further demonstrate that 
convenience sample sizes of thousands or tens of thousands 
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are also needed to realize practical gains. Finally, we 
demonstrate that a large probability sample size (1,000- 
10,000) is also needed for reasonably precise estimates of 
the remaining bias in initially bias-adjusted convenience 
sample estimators. Because the bias of estimates in an 
application to a multipurpose survey is likely to vary by 
outcome, the global decision to substitute a large number of 
inexpensive surveys for fewer traditional surveys must be 
made carefully. 

The greatest opportunity in cost savings may be in large 
surveys, simply as a function of their size. On the other 
hand, the greatest proportionate gains in precision are likely 
to occur for samples of intermediate size. Gains might also 
be substantial for large samples in which the main 
inferences are smaller subgroups. For example, a national 
survey of 100,000 individuals might make inference to 200 
geographic subregions, with samples of 500 for each. If one 
supplemented this national sample with a very large web- 
based convenience sample, estimated the bias nationally, 
and elected to assume that the bias did not vary regionally, 
one might decrease the MSE of the sub-region estimates 
substantially through the use of such a composite estimator. 

As a final caveat, the conclusions about the limited 
usefulness of convenience samples with estimator bias of 
more than 0.1 standard deviations are not limited to attempts 
to use a composite estimator. The same approach can be 
applied to show that an estimator based only on a 
convenience sample of any size with a standardized bias of 
0.2 (e.g., ten percentage points for a dichotomous variable 
with P=0.5) will have an MSE greater than or equal to 
that of an estimate from a probability sample of size 25. 
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